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Abstract 

What can a social network toll us about the underlying latent "social structure," the way 
in which individuals are similar or dissimilar? Much of social network analysis is, implicitly or 
explicitly, predicated on the assumption that individuals tend to be more similar to their friends 
than to strangers. Having explicit access to similarity information instead of merely the noisy 
signal provided by the presence or absence of edges could improve analysis significantly. We 
study the following natural question: Given a social network — reflecting the underlying social 
distances between its nodes — how accurately can we reconstruct the social structure? 

It is tempting to model similarities and dissimilarities as distances, so that the social struc- 
ture is a metric space. However, observed social networks are usually multiplex, in the sense 
that different edges originate from similarity in one or more among a number of different cate- 
gories, such as geographical proximity, professional interactions, kinship, or hobbies. Since close 
proximity in even one category makes the presence of edges much more likely, an observed social 
network is more accurately modeled as a union of separate networks. In general, it is a priori 
not known which network a given edge comes from. While generative models based on a single 
metric space have been analyzed previously, a union of several networks individually generated 
from metrics is structurally very different from (and more complex than) networks generated 
from just one metric. 

In this paper, we begin to address this reconstruction problem formally. The latent "social 
structure" consists of several metric spaces. Each metric space gives rise to a "distance-based 
random graph," in which edges are created according to a distribution that depends on the 
underlying metric space and makes long-range edges less likely than short ones. For a concrete 
model, we consider Kleinberg's small-world model and some variations thereof. The observed 
social network is the union of these graphs. All edges are unlabeled, in the sense that the 
existence of an edge does not reveal which random graph it comes from. Our main result is a 
near-linear time algorithm which reconstructs from this unlabeled union each of the individual 
metrics with provably low distortion. 
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1 Introduction 



Much of social network analysis is, implicitly or explicitly, predicated on the assumption that people 
tend to be more similar to their friends than to strangers. While many tasks — such as analyzing 
power and centrality, trading and exchange, or understanding and influencing the diffusion of 
viruses or information — rely crucially on the precise network structure, many others — such as 
link prediction, identification of communities, or marketing to friends of past buyers — use network 
structure as a noisy signal about an underlying social similarity space. To illustrate this insight 
differently, consider altering a social network data set by removing links between "dissimilar" pairs 
of individuals, and inserting instead links between "similar" (but previously unconnected) pairs. If 
this change makes the analysis task easier, rather than impossible, then the analysis task is really 
about the "social structure" — the latent similarities and dissimilarities between individuals — 
rather than about the actual network structure. 

Given the abundance of important problems naturally phrased in terms of social structure 
(discussed in more detail below), it is a natural goal to explicitly reconstruct social structures from 
a given social network. Knowing the social structure may also be of independent interest, as it 
sheds light on the forces governing social link formation. 

The task of inferring social structure in this sense is made non-trivial by the following two 
obstacles. First, despite a general tendency for friends to be more similar than strangers, many 
friends are still sufficiently different from each other to look essentially random. Second, and 
perhaps more fundamentally, social networks are multiplex [191 EHl E2] : they tend to be the union 
of multiple often independent relations among the same actors. For instance, friendships could 
result from physical proximity, similarity of occupation, kinship, similarities of hobbies, etc. If 
individuals are very similar in even one such attribute, they are more likely to be connected. 

The main contribution of this paper is a near-linear time algorithm for reconstructing the latent 
social structure with provably low distortion. The model explicitly produces a union of graphs, 
one for each category, and an important feature of the algorithm is that it separates the different 
graphs from each other. We also provide two extensions which, respectively, further improve the 
distortion, and partially address the issue of data scarcity (i.e., very small node degrees). The 
algorithms in this paper are based on, and significant extensions of, a natural idea that is widely 
used in practice: nodes are likely to be close if they share many common neighbors. 

1.1 An overview of the model 

We posit a latent space model (described in detail in Section [3|) for the generation of social net- 
works akin to models widely used in the mathematical sociology, statistics, and computer science 
communities [IllSlllMllMllMlllQlliaESllMlEaEH] (see also the survey [ZH pages 15-21]). 

The model is based on two widely accepted tenets about social networks (e.g., [9l[56]). First, 
people are more likely to have ties with those who are similar to them, but also have many ties 
to others who are dissimilar^ Second, multiple social dimensions (such as geography, occupation, 
kinship, hobbies, etc.) can independently lead to interactions and the formation of ties. 

We call the social dimensions along which people can be (dis)similar (social) categories, to avoid 
confusion with the geometric dimensions of individual metric spaces. Each category is given by a 
metric space Pj, i = 1, . . . , JT; together, the Pj define the social distances between the individuals. 

^The model is agnostic about whether this similarity is caused more by homophily [471 157j (the tendency to form 
ties with those who are similar) or by social influence |55l 163) (the tendency to become similar to one's associates). 
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Each of the n individuals occupies a point in each of the categories. For concreteness, and in 
accordance with much of the preceding hterature, we assume that each category is a Euchdean space 
of known dimensionaUty [331 EH HOI HSl [Ml [M] , and that the density of the points corresponding to 
individuals is nearly uniform [34 [ 140 1 164] . Furthermore, we assume that the categories have small 
local correlation. The "local correlation" of two categories is the maximal overlap between any two 
small balls in those categories (see Equation ^ in Section [3]). 

Each category independently gives rise to a social network Qi, modeled as a random graph 
whose edge distribution is parameterized by the corresponding metric space T>i. Specifically, we 
use a slight variation of Kleinberg's small- world model [30], in which edge probabilities decrease 
polynomially in T>i(u,v). For our purposes, the key feature of the model is that the probability of 
shorter links is much higher, but long-range links also appear with a significant probability; this 
captures the first tenet. The algorithm observes the union Q = IJj ^« individual networks 

Qi (on the same node set), but does not learn which particular network(s) Qi an edge belonged 
to. This captures the second tenet; only the existence, but not the social "origins," of ties can 
be observedjj The algorithm's goal is to use Q to reconstruct the individual metrics Di with small 
distortion, with high probability (over the random network generation process). 

Importantly, social similarity spaces in general tend not to be metrics (see, e.g., [H]), in the 
sense that the triangle inequality fails to hold. The main reason is the presence of multiple social 
categories. For example, one's co-worker and one's relative could be very dissimilar to one another, 
even though the individual is similar to both. The inclusion of a union or minimum in the model 
is crucial to capture this. 

1.2 Algorithms and results 

Our main contribution is a near-linear time algorithm, called the Amoeba algorithm, which infers 
all individual categories with provably low distortion, with high probability. The following theorem 
captures the result slightly informally. 

Theorem 1.1 (informal). // the K metric spaces T>i are locally sufficiently different, and the 
average node degrees are at least Q,{K^ log^ n), then with high probability, the Amoeba algorithm, in 
near-linear time, reconstructs metrics T)^ such that D'- approximates Vi with constant multiplicative 
distortion (and at most polylogarithmic additive error). 

That this approximate reconstruction should be possible at all — regardless of the running 
time — is somewhat surprising. One might think a priori that after combining two social networks, 
there would simply be no way to tease them apart. 

In other words, a priori, the challenge appears to be information-theoretical (does the network 
contain enough information for distance reconstruction with any provable guarantees?) as much 
as computational. We also remark that even the single-category version was raised by Kleinberg 
|42| as an open question; we answer the reconstruction question in the positive even for multiple 
categories. 

The Amoeba algorithm, we well as all other algorithms in this paper, is broadly based on a 
heuristic widely used in practice (e.g., in Facebook, or see [H [SH [631 [67]) : edges {u,v) are more 
likely to be between friends in a category if they are "supported" by many common neighbors 

^Our model does not include any information such as demographics, location, wall posts, or communications which 
would frequently be available to social networking sites [5|. Our goal here is to understand at a fundamental level 
how much information on social structures can be inferred algorithmically from the observed social network alone. 
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of u and v in that category. However, to deal with multiple categories, low node degrees, or to 
sharpen the distance estimates, the basic idea of counting common neighbors needs to be extended 
significantly. 

The Amoeba algorithm, presented and analyzed in detail in Section [H consists of two stages. 
In a first stage, individual edges are pruned if they do not have enough common neighbors, a direct 
implementation of the common neighbors heuristicH In the second stage, which we call the Amoeba 
stage, basic estimates of the individual categories are constructed one by one. Each iteration starts 
with a polylog-sized clique in the graph computed by the first stage, which is then expanded one 
edge at a time: an edge (u, v) is added to a category only when enough of u's neighbors lie in 
a small ball around v according to the current estimate of the category. The basic idea is that 
any sufficiently large clique must be sufficiently close in one category. The clique then bootstraps 
further iterations, in that a node u with many edges to a small ball around v must itself be close to 
V. While this intuition is straightforward, each iteration loses accuracy, so it takes a delicate proof 
to show that this refined version of the common neighbors heuristic guarantees low distortion. 

We improve the main result in the following two directions. The first direction (Sections[5]and[6]) 
focuses on improving the distortion using long-range links, which are now treated as an additional 
data source rather than an obstacle to be pruned. We improve the distortion from a multiplicative 
constant to a factor 1 + o(l), using a post-processing phase (run after the Amoeba algorithm) 
which we call Two-Ball Algorithm. This is a variation of the common neighbors heuristic where 
instead of common neighbors of two nodes {u,v), the algorithm counts long-range links between 
two node sets. The node sets are low-radius balls around u and v according to the initial distance 
estimates. This result requires a stronger notion of low correlation between categories. Under a 
stronger uniform density conditions, the Two-Ball Algorithm can be applied recursively, yielding 
unit distortion (with at most polylogarithmic additive error). 

Second (in Section [7|) , we deal with the issue of data scarcity, which in our setting translates 
to low node degrees. In the low (constant) node degree regime, the common neighbors heuristic 
is uninformative, and it instead becomes necessary to count disjoint constant-length paths for a 
suitably chosen constant. Combining the new initial pruning phase with a subsequent Two-Ball 
Algorithm requires a much more careful analysis, which shows that all sufficiently long edges can 
be treated as mutually independent given the pruned graph. We recover (essentially) all our results 
for the single-category case; extending the results to multiple categories remains a direction for 
future work. 

For both extensions, more detailed descriptions of challenges, results, and high-level approaches 
are deferred to the introductory portions of the corresponding sections. 

Our algorithms are modular: a pre-processing step (counting common neighbors, or the low- 
degree algorithm of Section [7|) prunes away very long edges. The Amoeba step separates different 
metrics and constructs initial distance estimates (though we have not adapted the algorithm and 
analysis to low node degrees). Finally, the Two-Ball Algorithm and its recursive version can be 
used to further improve the distortion in individual categories. 

^Sarkar et al. [64] showed that under a model similar to ours (but using edge probabilities that decrease exponen- 
tially with distance), counting common neighbors leads to an accurate distance estimate for a single-category social 
network. 
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1.3 Discussion of the model 



Our modeling goal is not to define a model of social networks capturing all of their features; this 
would be a formidable/impossible task for which there is much research but not much consensus. 
Instead, we aim for generally accepted modeling choices which capture in a clean way the main 
algorithmic challenges inherent in rigorous distance reconstruction. In particular, our main goal 
was to capture the two conceptual obstacles to distance reconstruction: links between dissimilar 
individuals, and multiple social categories. Nevertheless, we discuss some particular modeling 
choices in more detail. 

1. In Kleinberg's small- world model [40^ [39} Wl\ [l2l I22| . a version of which we adapt as a 
generative model for individual categories, the probability for an edge between two nodes to 
exist decreases polynomially in the nodes' distance. Naturally, many other distributions lead 
to distance-based random graphs [8]. 

Much of the past work in the statistics community [33l [Ml HSl [Ml [64] assumed that the 
edge probabilities were logit-linear in the distance, i.e., that log(j^) is linear in V{u,v). 

Since long-range links are thus exponentially unlikely {p = ^'^^-JdIu,v) )-> the reconstruction 
task becomes much easier. More importantly, to the extent that precise distributions have 
been empirically tested, remarkable fits have been found O [5l [52] with Kleinberg's inverse 
polynomial distribution |40^ l41jFI Furthermore, our main constant-distortion result holds for 
a much more general class of distributions, including logit-linear distributions. 

2. The choice of Euclidean spaces with near-uniform density. Both choices (Euclidean and near- 
uniform) are ubiquitous in past worljl [29l [331 [Ml [Ml [101 [151 [621 [M] , and are made mostly 
for technical convenience; they allow us to separate the conceptual difficulty of teasing apart 
different metrics and inferring distances with low distortion from the technical difficulty of 
dealing with arbitrary metric spaces. We believe that future work will achieve similar results 
for more general metric spaces or related structures, in particular, ultrametrics [141 1411 [68] . 
which are another popular choice of latent metric spaces. 

3. The choice of a union or minimum to combine individual metrics. This choice is clearly a 
simplification of reality: individuals are more likely to form ties if they share similarities 
in multiple dimensions, e.g., they work in the same field and live in the same town. Our 
model is supposed to capture in the cleanest way the difficulty of separating edges originating 
from different categories, and is certainly a better approximation to reality than widely used 
models treating the social structure as one metric space. 

Our model is closely related to (and a slight generalization of) a notion of social distance 
proposed by Watts, Dodds, and Newman [76], which treats the social distance as the minimum 
of distances in multiple metrics. To the extent that past work explicitly discussed models 
of multiple categories, it was also based on the minimum [33l pp. 337, 348], (Ml P- 335]. A 
generalization to more realistic models is a natural direction for future work. 

''However, links that appear long could plausibly be short in another metric; whether inverse polynomial distri- 
butions remain prevalent when multiple metrics are considered is an interesting — although difficult — direction for 
future empirical work. 

^In many respects, our kind of latent space models deteriorate if node densities can be highly non- uniform |28) . 
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4. We capture a notion of "independence" between categories by requiring that small balls in 
different categories have small overlap. Even without restrictions on computational resources, 
some assumption about "independence" is clearly necessary: if categories could be extremely 
similar, then no low-distortion reconstruction seems possible. It is an interesting direction for 
future work whether a few isolated violations of the condition permit low-distortion recon- 
struction in all but the affected areas of the metric spaces. 

Our condition is significantly weaker than requiring probabilistic independence. Several past 
papers (using a single metric space) assumed that nodes were placed independently and 
uniformly at random over some space |34l [64] ; such a model of individual categories would 
imply our "small intersection" condition with high probability. In fact, we show in Section 
[8] that with high probability, the "small intersection" condition holds even when nodes are 
placed adversarially, and their names are permuted randomly. We also remark that while in 
reality, we will frequently observe high correlation between "categories" (such as work and 
hobbies), this could be construed as a sign that the categories should be chosen differently, in 
order to represent the latent traits that manifest themselves in choices of both occupations 
and hobbies. 

1.4 Applications 

Our work provides two natural reconstruction abilities: separating edges by categories, and recon- 
structing individual categories with low distortion. Both of them have multiple useful applications. 

Important industrial applications for social network information include improving ad place- 
ment {social advertising) , web search results {social search), and product recommendations. These 
applications are of vital importance for some of the major players on the Web. A key commonality 
of all three applications is that they use the behavior of friends (clicking, searching, purchasing) to 
predict the behavior of an individual. Yet, two recent studies [3H [53] undertaking a quantitative 
evaluation of the predictive power of social links for purchases and click behavior have found at 
best mixed evidence. 

This apparent conundrum is resolved by noticing that many links are long-range, and short- 
range links may be short in an irrelevant category for the prediction task. Indeed, a recent data- 
driven study by Tang and Liu [75] has shown that social link-based classifiers perform much better 
when edges are labeled with categories in which they are short. We conjecture that such classifiers 
would improve even further if instead of edges, the actual social distance between nodes were used. 

The ability to separate social categories also enables the automatic detection of circles of friends 
from different contexts in social networking sites. This automatic detection has been cited as one 
of the main selling points of Google+, and is at the heart of the startup Katango. In this sense, our 
work provides some theoretical underpinnings for this fast-growing facet of the social networking 
market. Separating edges by categories has the additional benefit that one can identify when edges 
are short in more than one category, which could enable the automatic detection of close friends 
[781 [79]. 

Another natural application is the discovery of "social communities" |10 [ I20 [ [2T | I16 [ 166]. One 
might argue that the plethora of different network community detection objectives and heuristics 
is largely a result of stating the objectives and algorithms in terms of the graph structure, when 
the goal is really to identify clusters in the metric spaces. Since the social space is rarely explicitly 
modeled or related to the network, the connection between the objective function and the actual 
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desired object is absent. Explicitly reconstructing the social space would constitute the first step 
toward a more sound community identification algorithm. The presence of multiple categories in 
the model will naturally give rise to overlapping communities as well. Indeed, some of the work on 
reconstructing Euclidean spaces in the statistics community [33l |45] is explicitly motivated by the 
desire to identify communities, and builds community structure into a Bayesian prior. 

Social distances can also be used to predict unobserved or potential social links. Link prediction 
has been studied in [Tj 114 1 [ST j 164 1 167]. Unobserved or potential links are most likely present between 
node pairs at small distances; hence, once distances are known, missing links can be predicted easily 

MM- 

2 Related work 

Our work is related to work in a large number of communities: latent space reconstruction in statis- 
tics and mathematical sociology, community discovery, small-world networks, network localization, 
and metric space embeddings. We discuss the different areas in their separate sections. 

2.1 Latent Space Reconstruction 

Several recent papers [5l [111 [291 [331 [Ml [Ml SSI [Ml [Ml ESI EH] a-™ to reconstruct latent metrics from 
an observed social network. The precise models differ across these papers: most assume Euclidean 
spaces [SI HSl [331 [Ml [Ml SSI [Ml [Ml ES] , while a few consider ultrametrics to model hierarchical com- 
munities [141 168] . Among the papers considering Euclidean spaces, there are different assumptions 
about link distributions: most assume a logit-linear model [33 ] 134 ] l^5l 1621 164] . while a few consider 
inverse polynomial "small-world" distributions [Sll2l[M]l!l There are many other modeling dimen- 
sions along which these papers (and ours) differ, including: variance in node degrees, additional 
information about nodes (such as locations of some nodes [S]), uniform or clustered priors for node 
locations, whether algorithms are supposed to be centralized or distributed [M], etcJil 

Two main differences stand out between our work and the majority of these papers (in addition 
to the more minor modeling differences). First, we model multiple categories, which is extremely 
realistic, but makes the model, algorithms, and analysis significantly more complex. Second, the 
majority of the work cited above [SI [HI [33l [Ml [Ml SSI [Ml EH] estimates the underlying space either 
using Maximum Likelihood Estimates (MLE), or by imposing a Bayesian Prior and maximizing 
the probability of the chosen locations. Both appear to be very complex problems, and indeed, 
all of the papers employ heuristics (based on Gibbs Sampling, Metropolis-Hastings, Simulated 
Annealing, etc.) without guarantees on the likelihood or probability of the solution returned. More 
fundamentally, even if it were possible to obtain the MLE or highest-probability solution, it is not 
clear that it would come with any guarantees on the worst-case (or even average) distortion; the 

®We remark that several recent studies [2] O [52] show that the frequency of friendships as a function of (2- 
dimensional) geographic distance, when corrected for non-uniform densities, appears to decrease as 0(r~^). This 
gives some tentative empirical evidence in favor of "small-world" distributions. 

'^Much of the recent work in the mathematical sociology community has focused on exponential random graph 
models, which in a sense "hard-wire" desired distributions of certain features. These models are generally of a very 
different nature from latent-space models. A recent paper by Butts [12] combines features of both location-based 
and exponential random graph models; like the other papers listed above, it is not clear whether inference of model 
parameters would be tractable, and whether it would lead to any guarantees on distortion. 
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objective function does not explicitly model distortion, and in particular may be sacrificing the 
distortion of some edges in order to optimize the more global objective. 

Two notable exceptions to the MLE/Bayesian approach are the works of Fraigniaud, Lebhar, 
and Lotker |29j and Sarkar, Chakrabarti, and Moore [64] • Fraigniaud et al. [29] aim to reconstruct 
a single-category small-world model in order to use the distance estimates for greedy routing. They 
propose a heuristic based on an MLE intuition; interestingly, this heuristic leads to essentially 
counting common neighbors. Their algorithm may retain a small number of long-range edges, and 
hence does not come with provable guarantees on the distortion of the reconstructed metric space. 
They prove that this does not stand in the way of greedy routing: despite the lack of distortion 
guarantees, the distances they construct provably enable greedy routing along poly-logarithmic 
length paths. 

Sarkar et al. [6l] begin from the goal of explaining why simple heuristics for link prediction, 
such as counting common neighbors, are successful. They show that such heuristics can be under- 
stood as identifying close pairs of nodes in a latent Euclidean space, and use this insight to give 
provable guarantees on the performance of several heuristics for link prediction. (They also suggest 
additional heuristics). In the process, they show how a metric space is implicitly reconstructed 
by counting common neighbors. There are a few key differences between their work and ours. 
First, their distributions are logit-linear, implying that long-range edges are extremely unlikely. 
The reconstruction task is still non-trivial, but they do not have to deal with any very long-range 
edges, of which our model will have many. Second, they only consider a single category; for us, 
the single-category pruning step is a departure point for the more complex stages of separating the 
different categories, and using long-range links to improve the distortion. 

2.2 Overlapping Communities 

There are conceptual similarities between our work and concurrent and independent work by Arora 
et al. [3] and Balcan et al. [6]. Their goal is more specifically to reconstruct overlapping community 
structure in graphs; similar to our approach, they also posit that the social network is a noisy signal 
about some true underlying social structure, and communities are defined with respect to those 
structures. Recall that the goal of properly identifying communities is also one of the motivations 
for our work, although we do not explicitly pursue the question of reconstructing communities with 
provable guarantees. 

The major difference between our work and that of [Sj [6] is that both Arora et al. and Balcan et 
al. assume a set-based latent structure (each community is modeled as a set), whereas we assume 
a latent structure based on a near-uniform-density metric (each social category is modeled as 
a separate metric space). This difference, in turn, leads to different random graph models and 
algorithmic ideas. In principle, the set-based structures could be modeled as 0-1 metrics (and 
thus fit into our framework); however, such metrics would dramatically violate our uniform density 
assumption, so that our algorithms are not applicable. 

Nonetheless, some conceptual similarities between our work and |3l [6] are worth noting. First, 
a crucial aspect of all three papers is the ability to deal with overlapping latent structures: multiple 
social categories in the present paper, and multiple communities for [3l[6]. All three papers need 
some notion of "gap assumption" that limit overlaps in order to handle such structures. Second, a 
high-level idea present in all three papers is to start with a "seed" and then "grow" it to find the 
respective latent structures. While the high-level algorithmic ideas are similar, the details differ 
significantly between our Amoeba algorithm and the algorithms in [3l[6]. The Amoeba algorithm 
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grows the "Amoeba" gradually, and using short disjoint paths, whereas [3l [6] use ideas related 
to finding hidden cliques. In addition, the goal of reconstructing metrics motivates substantial 
algorithmic extensions (discussed in Sections [5HZ]) related to improving the distortion and dealing 
with small node degrees. These algorithmic questions have no direct analogue in the setting of 
reconstructing communities. 

2.3 Network Localization, Embeddings, and Distance Oracles 

Reconstructing (low-dimensional, Euclidean) node distances from distance measurements has been 
studied both theoretically and practically from a wide variety of angles. In network localization for 
mobile and sensor networks (e.g., [H l72l[8T] ). and network embedding for peer-to-peer networks and 
the Internet (e.g., |60^ 115^ \8U [ I43j). distances are known fairly accurately, but typically only to a 
few "beacon" nodes. The challenge is to choose beacons, and combine measurements, to estimate 
pairwise distance. In our setting, the presence or absence of edges provides much less reliable 
estimates of distances. However, once we succeed in obtaining basic distance measurements, the 
techniques from network embedding/localization can lead to further improvements in the estimates 
without a blowup in the running time, as shown in Section [71 

We measure the quality of our inferred metrics in terms of the distortion of the estimates. 
Distortion is commonly used as a measure of quality in metric embeddings and distance oracles 
(see, respectively, [35] and [82] for surveys). In those domains, distances are known precisely, and 
the challenge is typically to find a compact and faithful representation, for instance in terms of low 
dimensionality of the target metric or small space of the oracle. In our setting, the true distances 
(in each category) are not explicitly known, and the estimates are very noisy. Similar to metric 
space embeddings, our goal is to extract a faithful representation of each category. However, a 
second fundamental difference is that the space we "embed" in consists of multiple metrics, and 
thus severely violates the triangle inequality. 

Our focus on near-uniform density metrics is motivated by similar notions of low dimensionality 
in metric embedding, nearest neighbor search, and a number of other problems, e.g. [361 [32l [HI [7H 
I43j . In particular, near-uniform density has been used along with various modeling assumptions in 
[371 [13]. 

2.4 Small- World Networks 

A long line of empirical studies confirms that many social ties and interactions correlate strongly 
with social distance, and particularly geographical distance (see, e.g., [561 [59] for a discussion). For 
example. Butts [TT] gives calculations showing that geographical information alone could reduce the 
entropy in network prediction by roughly 90% under moderate assumptions. More specifically, sev- 
eral recent studies [21 \5\ [52] show that the frequency of friendships as a function of (2-dimensional) 
geographic distance, when corrected for non-uniform densities, appears to decrease as 0(r~^). 

Small-world models aim to capture the natural tradeoff between a preference for shorter links 
and the randomness observed in the presence of long-range links. Initial models were due to Watts 
and Strogatz [7^ and Kleinberg [401 141] . One of the main goals in these papers was to explain 
why greedy routing — based only on the position of one's neighbors in the metric space — can 
discover paths of polylogarithmic length. Since the publication of [401 I41|. a large number of papers 
in the theoretical computer science community have expanded the models and results in various 
ways [71[ig[Ml[H[25l[271[26l[Ml[3Ql[ll[l9l[l8l[Ml[65 The main focus in the community has 
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continued to be the ability of small- world networks to route greedily and efficiently. In particular, 
the goal has been to find ways to augment graphs with suitable long-range links or (semi-)metrics, 
provide nodes with additional knowledge or let them perform some local graph exploration, or 
exploit non- uniformity in node degrees, all in an effort to achieve routing along paths of essentially 
optimal length. Several good recent surveys summarize the work along these lines [22 \ 142 1 150]. 

3 Definitions and Preliminaries 

We define a formal model for the latent social space that gives rise to observed social networks. 
In general, it will not be a metric space: it naturally possesses multiple social dimensions, and 
proximity in just one of those dimensions (e.g., geography or occupation) usually means that 
individuals are "close." 

First, we define a basic model of a single social metric space. We then discuss how to extend the 
concept to multiple metrics; in particular, we formalize a notion of metric spaces being sufficiently 
"independent." 

We begin with some formalities. Throughout, y is a ground set of n nodes. For a metric 
T>, we use the standard notion of balls, i.e., B{u,r) = {v \ T>{u,v) < r}. We liberally use O(-) 
notation to simplify the presentation. In theorem statements, the constants in O(-) can depend on 
the constants in our setting. Elsewhere, the constants in O(-) are absolute, unless noted otherwise. 

Most of our results are with high probability, with respect to the randomness in the graph 
generation process. By this, we mean that the success probabilities are 1 — n~'^, where the constant 
c > 1 is large enough to allow all needed applications of the Union Bound (over polynomially many 
events). By a slight abuse of notation, we will write with high probability for probability 1 — n~'^, 
without explicitly specifying the constant c > 1. 

3.1 A model for one social category 

A single category of the latent space is modeled essentially as a d-dimensional Euclidean space. 
More precisely,!^ is a subset of the d-dimensional tor?i that is, the nodes lie in [0, R]'^ for some 
R, and the distance between points x, y G [0, R]'^ is P(x, y) = (^-(mindxi — yi\,R— \xi — Vil))''')^^^- 
We require that the node density be nearly uniform, in the following sense: any unit cube in the 
torus contains at least one and at most Cud nodes, for some known constant Cud ^ 1- (Since Cud 
will always be a constant, we will sometimes hide Cud factors in O(-) notation.) For some of our 
results, we also want to use the actual lattice structure as a reference: We refer to the graph of 
integer points from [0, R]'^ with edges between all pairs at distance ^{x, y) < 1 as the toroidal grid. 

If nodes u,v are at distance r = T>{u,v), then the edge {u,v) is present independently of other 
edges, with probability /(r) = min(l, Cgg/cgg ''~'^)- Here, Cgg = is a normalization constant 

chosen to ensure that the expected average node degree is 1 whenever k^g = 1. Then, k^g is a 
parameter controlling the expected average node degree. When Csgksg < 1, the expected average 
degree is exactly k^g] otherwise, the dependence of the node degree on kf,g is sublinear and strictly 
monotone. We call ksg the target degree, even though strictly speaking, it does not equal the 
average degree. Following the literature (e.g., \40\ I41j). we focus on the cases kgg = 0(1) and 

*Prior work deals with a d-dimensional grid, which is somewhat undesirable, as there is an asymmetry between 
the nodes on the border and on the inside, which gets more pronounced in higher dimensions. 
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ksg = polylog(n). We use Egg to denote the edge set obtained from this distribution, and QiV^Vi) 
for the random graph model, which we cah the single- category social graph. 

When ksg > l/Cgg, all edges of length at most 1 are present in Egg with probability 1. Otherwise, 
even to ensure connectivity of the social graph, one must insert a suitable "local edge set" separately. 
(For instance, much of the literature on small-world networks assumes that the d-dimensional grid 
is always part of the graph.) This issue is discussed in more detail in Section [71 in the context of 
low node degrees. 

Our main result easily extends to a more general model in which, for a suitably large R = 
polylog(n), an edge {u,v) of length r = V{u,v) is present with probability at least /(r) for all 
r < R, and with probability smaller than /(r) for all r > R. We omit this generalization for ease 
of presentation. 

3.2 Multiple social categories 

When multiple social categories give rise to edges independently (such as work-related, geography- 
related, and hobby-related friends), we model the observed social network as the union of the 
graphs generated by the individual categories. Formally, each social category is a single-category 
social graph Qi = QiV^Vi) with near-uniform density for i = 1, . . . , K, and the edge sets of the Qi 
are mutually independent. X is a (small) constant. Balls with respect to the category-z metric are 
denoted by Bi{u,r). A multi- category social graph is obtained by taking the union of all edges, 
i.e. Esg = U 1 1 -^sg • Taking the union is analogous to defining the social distance as the minimum 
over the categories; in particular, the social space thus defined is not a metric. 

The different categories may have different parameters, such as the target degree or number 
of dimensions. If the target degrees are vastly different, then one category could be completely 
"drowned out" by other, denser, categories, which would make it impossible to observe its structure. 
Therefore, we assume that the target degrees ksg of the categories are within a known constant 
factor of one another. We define the target degree of the multi-category social graph as the average 



3.3 Local Disjointness of Categories 

In order to be able to distinguish the edges arising from different categories, it is necessary that 
the underlying metrics of different categories be sufficiently different. We capture this intuition by 
requiring that any pair of small balls in two different categories be sufficiently different: formally, 
the Local Category-Disjointness condition states that for any two balls Bi{u, r), Bii{u' ,r') in distinct 
categories i ^ i' , with r, r' = O (poly log (n)), 



This condition suffices for our main result; some of the extensions require a similar but stronger 
local condition called Scale- Category-Disjointness, which will be introduced in Section [6j The 
Local Category-Disjointness condition is not overly strong; for instance, we prove (in Section[8]) that 
both Local Category-Disjointness and Scale-i? Category-Disjointness hold with high probability 
when node identifiers within each category are randomly permuted. 



k - J- . V A-^ 




(1) 
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3.4 Input and output 

Since our model has several parameters, we need to be precise about what is known to the algorithm. 
Most importantly, in terms of the social network, only the union Egg of all social network edges is 

(i) 

revealed to the algorithm; the division into individual categories Egg is not given. 

We assume that the algorithm knows how many embeddings it needs to construct, and into 
what spaces. More formally, this means that K (the number of categories), di (the number of 
dimensions), and Ri (the sizes of the tori) are known to the algorithm. The average target degree 
ksg can be estimated from the expected degree, and by Chernoff Bounds, such an estimate will 
be within 1 it 0(n~^/^) of the correct value with high probability. According to the model, the 
individual target degrees fcgg he within a constant factor of ksg, and we assume that this constant 
factor is also known to the algorithm. To simplify presentation we assume that the target degrees 

(i) 

ksg and the dimensions di are the same for all categories i, and that ksg is known. 

We also assume that the upper bound Cud on the number of points in any unit cube is known 
to the algorithm. Knowing Cud and the other model parameters, the normalization constant 
Csg = ©(i^g^) can also be computed to within a constant factor. 

The goal of the algorithm is to output metrics D'^ that approximate the original Di. If the 
output satisfies 

aV{u,v) < V'{u,v) < 6V{u,v) + A 

for all node pairs u, v, then we say that T>i estimates T>i with contraction a, expansion 6 and 
additive error A. The multiplicative distortion of T>^ is then 5 /a. If we mention no multiplicative 
distortion (or contraction), then we implicitly refer to the case of distortion (contraction) 1. We do 
not require that D'- itself be a dj-dimensional Euclidean metric, only that it approximate Di with 
low distortion. 

3.5 Chernoff bounds 

In many places, we bound tail deviations using standard Chernoff Bounds. Specifically, we use the 
following version, which can be found, e.g., in [TTl pages 6-8]. 

Theorem 3.1 (Chernoff Bounds). Let X he the sum of independent random variables distributed 
in [0, 1], and let fJ-' > n = E[X]. Then the following hold: 

Prob[|X-/i| > (5/x] < exp(-^5V3), for any S > (2) 
Prob [X > {l + 6)fi'] < exp(-/x'5V3), for any 6 G (0,1). (3) 

The bounds in Theorem 13.11 sometimes apply (and are useful) even when the summands are 
not independent. In particular, our analysis of Local Category-Disjointness and Scale- i? Category- 
Disjointness in Section [8] uses one such result in which the randomness arises from a random 
permutation. We state and prove the corresponding version of Chernoff Bounds in that section. 
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4 The main result 



In this section, we present our main result, an algorithm for distance reconstruction for multiple 
categories with constant distortion. 

Theorem 4.1. Consider a multi-category social graph with Cggksg = Q{logn), near-uniform density 
and Local Category-Disjointness. There is an algorithm that with high probability reconstructs 
distances in each category with constant expansion, no contraction, and poly log (n) additive error. 
Moreover, such distance estimates (as spanner graphs or as distance labels) can be computed in 
time n poly log (n). 

4.1 Overview and intuition 

We begin with a high-level overview of the algorithm and the intuition for the proof, before dis- 
cussing the different stages in detail in individual subsections. Recall that the algorithm's input 
is the set E^g = |Jj Esg of edges from all categories. For the entire section, we assume that the 
average node degree is high enough: Csgfcgg = Q{16'^K^ logii) for a sufficiently large constant in 
Let Hoc = ©((C'ggA^sg)^'''^) be the local radius: by definition of the generative model, all edges 
between node pairs {u,v) at distance T>{u,v) < Hoc are in Esg with probability 1. We define the 
pruning radius to be rp^u = 6(rioc-f^^''"')- 

The algorithm proceeds in multiple stages. Each of these stages makes use of the (random) 
long-range edges. To avoid stochastic dependencies between the stages, we can randomly partition 
the edges of Esg into a constant number of sets. Each stage then makes use of its own set. Since 
the nodes' degrees are high enough, this does not affect the high-probability guarantees. For ease 
of notation, we will not explicitly talk about the partitions for the remainder of this section. All 
results in this section hold with high probability. 

In the first stage, called the Two-Hop Test, the algorithm produces a pruned set -Epru (which 
need not be a subset of Esg), with the following guarantee for all node pairs {u,u'): 

• If u,u' are at distance at most Hoc in (at least) one category i, then {u,u') £ -Epru- 

• If u,u' are at distance at least rpj-u in all categories i, then {u,u') ^ i^pru- 

Thus, the guarantee is that all short edges are present, and all sufficiently long edges are absent. 
The algorithm makes no guarantees for node pairs in the intermediate distance range. 

To achieve this pruning, the Two-Hop Test counts the number of 2-hop paths (common neigh- 
bors) between {u,u'), and compares it to a carefully chosen threshold. Similar to what Sarkar 
et al. [Mj showed for the single-category case and the logit-linear edge probabilities, our analysis 
shows that this simple heuristic can provide provable distortion guarantees under the small-world 
model, even in the more difficult case of multiple categories. 

In the second stage, called Amoeba stage, the algorithm covers -Epm with individual edge sets 

-^amb (which need not be disjoint); the set E^^^ corresponds to category i. The key property we 

prove is that whenever u,v are at distance at most Hoc in category i, then {u,v) G -E'amb' whereas 

{u,v) ^ -E^^b whenever u and v are at distance at least Tamb = ©('^pru-?^^'''^) = ©('"loc-^^'''^)- Again, 
for the intermediate range, the algorithm makes no guarantees about the presence or absence of 

(i) 

edges. This guarantee implies that the shortest-path metric of -Eamb gi'^^s an embedding of T>i 



12 



with constant multiplicative distortion 0{K^/'^) for all node pairs at distance at least rioc, and 
poly-logarithmic additive distortion for all node pairs at distance at most rioc- 

The algorithm constructs the edge sets -E^mb '-'^^ ^'^^^ ^s-ch ii it begins by finding a 
poly-logarithmically large clique in -Epru that is sufficiently spread out in all previously constructed 
^imb- (We show using the Local Category-Disjointness condition that the node set of this clique 
will have diameter at most 4rpi.u in some category i). Starting from this clique, as long as possible, 
it adds edges (n, v) that are "supported" by enough edges (in E'sg) between f 's neighborhood in 

-^amb ^'^'^ ^- '^^^ part of our analysis is to show that this process will indeed add all sufficiently 
short edges (and in particular end up having added all nodes), while excluding all edges that are 
long in category i. 

Throughout this section, we frequently count the number of edges in £^sg between two node sets 
(one of which may be a single node). We usually calculate the expectation, and then invoke Chernoff 
Bounds to guarantee that the number of edges is within the desired range. The expectation or 
desired number of edges will be (at least) logarithmic, allowing the application of Chernoff Bounds. 

4.2 Pruning stage: the Two-Hop Test 

For a node pair u,v, let M\{u,v) be the number of two-hop u-v paths in Egg, i.e., the number of 
common neighbors of u and v in Esg. The Two-Hop Test is as follows: 

for each node pair {u,u), accept if M\{u,u) > M\, reject otherwise. (4) 

We define the threshold as M\ = @{ksgCsg), where the constant in 0(-) can be calculated explicitly 
from the known parameters. Henceforth, let -Epru be the set of all accepted node pairs. 

Lemma 4.2. With high probability, the Two-Hop Test accepts all node pairs of distance at most 
rioc iiT' some category, and rejects all node pairs whose distance is at least rpm in all categories. 

Proof. The proof is based on a careful decomposition of the metric space into intersections of rings 
around u and u' , allowing a sufficiently accurate estimate of the number of their common neighbors. 

We begin by proving the positive (acceptance) part. If u,u' are at distance Vi{u,u') < Hoc, 
then they are close enough such that the balls Bi{u,r\oc) and Bi{u' ,r\oc) overlap in a (dimension- 
dependent) constant fraction of their nodes. Counting the size of this overlap, and using that 
Hoc = 6((A;sgC'sg)^/'^), we get that 

|Si(7x,rioc)n5,(n',rioc)| > ^{2-''\B,{u,r,,,)\) > n{2~''e{{hgCsgy/y) > n{k,gCsg), 

for a sufficiently large constant in the definition of rjoc. In the original model, each edge between 
u or u' and a node in i?j(n. Hoc) H Bi{u' ,r\oc) is present with probability 1. Even if the edge set is 
randomly partitioned into a constant number of edge sets for the different stages of the algorithm, 
both u and u' will have edges to each node in Bi{u,rioc) H Bi{u' ,r\oc) independently with constant 
probability. An application of the Chernoff Bound therefore guarantees that M\{u,u') > i^{ksgCsg) 
with high probability, and Ma = r2(/csgCsg) for a suitably chosen constant. 

For the second part of the lemma (rejection), fix two nodes u,u' such that Vi{u,u') > rp^u 
for all categories i. Consider two categories i,i' {i = i! is possible), and define Si^ii to be the set 
of all nodes v such that (u, u) G E^g and {u',v) G Esg\ We prove a high-probability bound of 
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0{Csgksg/ K'^) on \Si^i'\ for a suitably small (absolute) constant in the O(-). A union bound over 
all pairs i, i' then implies the claim. 

We define a sequence of concentric rings of exponentially increasing radius around u, as follows: 

Ro = Bi{u,rpru/'^) 

= {v I Vi{u,v) E (2(^-1)/'^ . rp^j2, 2^/^^ • rp,,/2)}, for each j > 1. 

So i?j is the set of nodes at distance roug hly 2i/<^ ■ rpru/2 from u in category i. Likewise, we define 
the concentric rings around u' , with respect to category i': 

Ro = Bi>{u,rprn/2) 

Rj = Bi,{u\ 2^/"^ • rpru/2) \ B,,{u' , 2'^^-^^^ ■ rp,,/2) for each j > 1. 

The rings {Rj}j>o form a disjoint cover of V, as do the rings {R'j}j>o- To bound the size of 
Si^it, we bound n Rj H R'j, for all > 0. 

First consider the case j = j' = 0. For i = i' , Rq and R'q are disjoint by definition, and for 
i 7^ i' , the Local Category-Disjointness condition ensures that \Rq D Rq\ = O(logn). 

Next, we consider the case j > j' , j > 1. (The case / > j, j' > 1 is symmetric.) We write 
r = 2^1^^ ■ rpru/2 and r' = 2^ 1'^ ■ rp^^/2. By definition of the edge generation model, the probability 

that V G Rj has an edge to u in E^g is at most Csgksg{r /2^/^)~^ = 2 Csgksgr~'^, while the probability 
that V G R'j, has an edge to u' in Egg ^ is at most 2 Csgksg{r')~'^ , or at most 1 if j' = 0. The presence 
of these edges is independent of one another. Because Rj H R'j, is contained in Bii{u',r'), it can 
contain at most C\]j:,{r')'^ = 0{{r')'^) nodesH Thus, both for the case j' = and j' > 0, we obtain 
that 

E [ n Rj r\R!j,\]<0 {{C.ghgf r~\r'r\r'f) 
< O {{Cghgf (2^/'^ • rpru/2)-'^) 
<0{{C,ghgf 2'^r-^2-^). 

We now first sum over all j > j' (using that Ylj>j> ^ — 0{2^-'')), and then over all /, to obtain 
that 

^ E [\Si,i> n R, n R'j,\] < OiiCsgksgf 2'^r-t). 

By choosing rpi-u = 0(rioc-f^^'''^) with a suitably large (absolute) constant, we can cancel out the 
2*^ term and obtain an arbitrarily small absolute constant 7 in the O(-) term. Recalling that 
Hoc = ©((C'sgAjsg)^^'') and adding the at most O(logn) nodes (with some absolute constant) in 
Si^ii n RqCi r'q, we see that 

E < 0{jC,gk,g/K^) + O(logn). 

^Recall that we include Cud terms in O(-). 
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Applying ChernofF Bounds, we obtain that with high probabihty, \Si^i'\ = 0{'y Csgksg/ K'^ + 
logn), and a union bound over ah now shows that with high probabihty we have 

MAiu, v) = 0{-f Csghg + log n) < Ma 

(when Csg/cgg is large enough and 7 small enough), which means that {u,v) will be rejected. □ 

For the remainder of this section, we condition on the high probability event of Lemma 14. 2^ 
i.e., we assume that -Epru contains all edges of length at most ri^c (in at least one category) and no 
edges whose length would exceed rp^u in all categories. 

Notice that in the single-category case (K = 1), the result of Lemma 14.21 by itself already gives 
an expansion of Tpru/noc = ©(1)5 no contraction, and additive error polylog(n). We simply estimate 
T>{u,v) by the length of the shortest u-v path in the pruned graph, multiplied by rpi-u. Lemma 14.31 
analyzes the distortion for a single category, and will also be used for the multi-category case. The 
lemma requires the unit-disk graph to be a good approximation of the metric space, a property 
that is obvious for near-uniform density sets in W^. 

Lemma 4.3. Let {V, D) be a metric space. Let G he a graph on V that includes all node pairs at 
distance at most r and no node pairs at distance more than r' , for some r' > r > 1. Let Vq he the 
shortest-paths metric of G. Let V^^ he the shortest-paths metric of the unit disk graph on {V,D), 
and assume that V^'^{u,v) < c'D{u,v) for all node pairs {u,v), for some constant c. Then 

V{u,v) < r'-VG{u,v) < ^ ■V{u,v) +r'. 

In words, r' ■ Vq reconstructs T) with expansion no contraction, and additive error r' . 

Proof. Fix a node pair {u,v), and let p be a shortest u-v path in G. By the triangle inequality, 
T>{u,v) is a lower bound on the total metric length of p, which in turn is at most r' T>g{u,v), 
because each hop in G has length at most r'. So T>{u,v) < r' 'Dq[u,v). Now, let P be a shortest 
u-v path in P^p. Any two nodes on P that are within r hops from one another are connected 
by an edge in G. Therefore, G contains a u-v path of at most [— ] hops, which implies that 
f \ ^ rl'^''("i'') 1 ^ 1 1 cV(u,v) I— 1 

4.3 Amoeba stage: mapping edges to categories 

We now define the Amoeba stage of the algorithm. The Amoeba stage consists of K iterations 
i = 1,...,K: in each successive iteration i, a new category is identified (and re-numbered as 
category i), and some edges in -Ep^u are mapped to this category. These edges constitute the edge 

(i) 

set -Eamb- Eventually, each edge e G -Epru is mapped to at least one category. 

The Amoeba stage is summarized in Algorithm [TJ Each iteration i consists of an initialization 
phase, in which we find a suitable clique in i^prui and a growth phase, in which we grow E)^^^ one 
edge at a time. We think of this process as growing the amoeba. 

In Algorithm [1] and the subsequent analysis thereof, we use the following notation. For a subset 
5 C y, let diamj(S') be its diameter in -E^mb- r(ti,£') denote the (1-hop) neighborhood of 
node v in the edge set E. We call the clique C from iteration i the seed clique for category i. The 
condition ^ is called the Amoeba Test: more precisely, edge (u, v) passes the test if and only if ([5]) 
is satisfied. 
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Algorithm 1 The Amoeba algorithm. 

Output. Estimated social distance P^, for each category i = 1, . . . , K . 
Parameters. Numbers (Ma, Afamb, ^amb, J'amb)- 

Pruning Stage. Let Mf^{u,u') be the number of common neighbors of u and u' in E^g. 

Epru ^ {iu,u) eV xV : Ma{u,u') > Ma}. 

Amoeba Stage. For each iteration i = 1, . . . , K , 

1. Initialization phase. Find any clique C V in -Epru such that \C\ > A^amb) and diamj(C) > 
log^(n) for each category j = 1, . . . ,i — 1. Initialize E'amb = C x C. 

2. Growth phase. While there exists an edge (n, v) £ -Epru \ -Eamb such that 

Egg contains at least Mamb edges between u and r(?;,£'amb)) (5) 
pick any such edge and insert it into i^amb- 

3. Set -Eamb ~ -^amb- Let 'D^ be the shortest-paths metric of E)^^^, multiplied by ramb- 

Notation. Recall that diamj(5) is the diameter of a subset 5* CI ^ in E^_^y^, and r{v,E) denotes 
the (1-hop) neighborhood of node v in the edge set E. Condition ([5]) is called the Amoeba Test. 



The Amoeba stage is parameterized by numbers (Mamb; Aamb, ^^amb)- We set Aamb = 0((noc/2) ) 
and Mamb = 0(Aamb/(8'^-ftr^)) for suitable constants in 6(-)- We define ramb = 7amb • K^^^ ■ rpru 
for a sufficiently large absolute constant 7amb5 and call it the amoeba radius]^ 

4.4 Analysis of the Amoeba stage 

An edge {u,v) £ -Epru is called i-long ii'Di{u,v) > ^ambi and i-short if Vi{u,v) < r\oc. An edge set 
-Eamb ^ Epru IS an i-amoeba iff (F, -Eamb) contains no i-long edges, and it contains a clique of at 
least Aamb nodes whose category-i diameter is at most 4rpru- 

The high-level outline of the correctness proof for Amoeba is as follows. We will prove by 
induction on i that each edge set E^^^ captures (at least) all z-short edges (renumbering the 
categories appropriately), and does not include any i-long edges. 

The induction step requires that the algorithm be able to reconstruct another category i while 
there is an uncovered edge. Thereto, we show that Eamb remains an i-amoeba throughout the 
algorithm. We break the induction step into multiple lemmas capturing the following four key 
points: 

• The required seed clique C of size A^'amb exists in -Epru- 

• All edges in the seed clique have sufficiently small length. 

• No i-long edge passes the Amoeba Test. 

^"Recall that fegCsg — Q{W'K'^ logn) with a sufficiently large constant. In particular, if fcsgCsg = B(16'*-ft''' logn), 
then the parameters are A'^amb = Q{8''^K^ logn), Afamb = Q{Klogn) and ramb = Q{K^ logn)^'''*. 
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• While there is an i-short edge not yet added to i^ambi at least one such edge passes the 
Amoeba Test. 

(7) 

Lemma 4.4. // there is an edge e not included in any E^^^,, then Epm contains a clique of at least 
^amb nodes whose diameter in E^J^^ is at least log^(n) for all j. 

Proof. Let e G -Epru be an edge not included in E^\^ for all j < i, and let i be a category it belongs 
to. For an arbitrary node u, consider B = i?i(u, rioc/2). Because Vi{v,v') < rioc for all v,v' G B, 
the set B forms a clique in -Epru- Furthermore, because of the near-uniform density of category i, 
B has 0((rioc/2)'^) = Q{Csgksg) = Q{K^ logn) nodes, for a sufficiently large constant in the ^{■). 

For any j < i, the Local Category-Disjointness condition condition implies that |-Bj(M, ramb • 
log^(n))nS| < O(logn). Thus, there is at least one v S ramb ■fog^('^))- Because each edge 

in E'^^.h length at most Tamb in category j, this means that 'Dj{u,v) > log^(ri); in particular, 
B cannot have diameter less than log^(n) in E'^^.h- Since this holds for all j, B is a candidate for 
seed clique i, and the algorithm thus guarantees progress. □ 

Lemma 4.5. Let C be a clique in Epm of size \C\ > Q,{K'^ logn), for a sufficiently large constant 
in f^(-). Then, there exists a category i such that 'Di{u,v) < ^rpm for all u,v £ C. 

Proof. Fix an arbitrary w £ C. Because each edge {u,v) £ i^pru satisfies Vi{u,v) < rpru for some 
category i, there is a category i such that for at least \C\/K nodes v £ C,we have 'Di{w,v) < rpru- 
Fix such a category i, and let S be the set of all v £ C with Vi^w, v) < rpru. If 5" = C, then we are 
done. 

Otherwise, consider a node u £ C \ S. For each node v £ S, there is a category i' with 
T>ir{u,v) < rpru. In particular, there must be a category i' such that T>i'{u,v) < rpru for at least 
\C\/K'^ > r2(logn) nodes v £ S, with a large enough constant in f^(-). Fix such a category i', and 
let S' be the set of nodes v £ S with Vii{u,v) < rpru- Because S' C Bi{w,rp^^) n rpru), the 

assumption i' ^ i would contradict the Local Category-Disjointness condition. Hence i' = i, and u 
is at distance at most 2rpru from w in category i. Since this argument holds for every u £ C \ S, 
we have proved that C has diameter at most 4rpru in category i. □ 

Remark. Lemma 14.51 can be restated as saying that for any edge-coloring of a sufficiently large 
clique that is consistent with the Local Category-Disjointness conditiorcj, there is a color i such 
that the set of edges of color i has diameter at most 4. Without the Local Category-Disjointness 
condition, this statement is false in general for K > 3. For a simple counter-example, consider a 
clique C whose nodes are partitioned into three sets Ci, C2, C3 so that color i £ {1, 2, 3} is assigned 
to all edges with both endpoints in Ci and to all edges with neither endpoint in Cj. Then, the 
edge set corresponding to any one color i is not even connected. For K = 2, there is a simple 
combinatorial proof that does not involve Local Category-Disjointness. 

Lemma 4.6. Assume that Eamb ^ Epm contains no i-long edge, and let u, v be nodes with (n, v) £ 
Epru md T>i[u,v) > ramb- Then, with high probability, {u,v) does not pass the Amoeba Test. 

^^Reformulated in terms of edge colorings, the Local Category-Disjointness states that two balls with respect to 
edges of colors i ^ i' , each of radius polylog(n), overlap in at most O(logn) nodes. 
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Proof. We bound the number of edges between u and r(t;,£'amb) in two parts: by the number of 
edges between u and Bi{v,rpj-^), and the number of edges between u and r(t;,£'amb) \ Bi{v,rprn)- 
First, we claim that |r(z;,£^amb) \ rpj-u)! < 0{K logn). The reason is that any node 

w S r(t),£'amb) \ Bi{v,rpT^) must be at distance at most rp^u from v in some category j ^ i 
(because {v,w) G -Epm), so w G rpru) n , ramb)- Now, the Local Category-Disjointness 

condition implies that there can be at most 0(log n) such nodes w for any fixed j, and thus at most 
O(J^logn) total. 

Next, we consider nodes w E Bi{v,rp^y^). By the Local Category-Disjointness condition for 
-Bj ( V, Tpru) n ramb)i there can be at most O(logn) such nodes w at distance at most ramb from 

u in category j, for a total of 0{Klogn) nodes. 

All other nodes w € Bi{v, rpm) are at distance at least ramb from u in all categories j / i, and at 
distance at least ^amb — '''pm > ''"amb/2 from u in category i. Thus, the probability for the edge {u, w) 
to exist in any one category j is at most q = 0{Csgksgr~^^) = 0{Csgksg/ {'j^^^K^) -Tpj.^). Summing 

over all w G Bi{v,rp^u) and all categories gives us at most qK\Bi{v,rpj-^)\ = 0(C'sg/csg/(7fmb-^^)) 
edges in expectation, and Chernoff Bounds prove concentration. Adding the at most 0{K logn) 
edges of the first two types, and recalling that 7amb is a suitably large constant and Csgfcsg = 
Q{K^logn) with a large constant, we see that with high probability, the total number of edges 
between u and r{v, -Eamb) is less than Mamb) so the edge {u, v) does not pass the Amoeba Test. □ 

Lemma 4.7. Let Eamb an i-amoeba that does not include all i-short edges. Then, w.h.p., there 
exists an edge {u, v) G Epm that is accepted by the Amoeba Test. 

Proof. First notice that because the Amoeba Test only counts edges from u to a neighborhood of 
V, it is monotone in the following sense: if the edge e passes for some current edge set i^amb, then 
it also passes for any -Eamb — -^amb- We will define an ordering ei, 62, . . . of all edges in category i 
such that with high probability, eg will pass the Amoeba Test whenever C U {ei, . . . , eg-i} C ii^amb- 
Thus, Amoeba, starting from C, can always make progress when considering the lowest-numbered 
edge Ci not yet included. (Notice that this does not require the algorithm to actually know the 
ordering.) 

Let C be the clique in (V,-E'amb) of size at least A'^amb whose existence is guaranteed by the 
definition of an i-amoeba. C C Bi{w,2rpT:^) for some w, and Bi{w,2rp^^) can be covered by 
0((''pru/noc)'^) = 0{K'^) balls of radius riod^, at least one of which must therefore contain a 
sub-clique C" C C of at least A'amb/-?^^ nodes. Let vq be the center of such a ball Bi{v' ,r\oc/2). 

First, all edges between u G i?j(uo, noc/2) and v £ C will pass the Amoeba Test, because 
(n, is i-short for all w £ C CI r(i;,£'amb) (implying that the edge {u,w) is in -Epm), and 
\C'\ > N^n.^/K^ > Mamb. 

Second, because each v £ i?j(uo, noc/2) is now connected to all of C in i^amb, the exact same 
argument applies to all node pairs u,v £ i?i(fo, noc/2). 

Third, we use induction on r, showing that once all edges in Bi{vo,r) have been included, all 
edges in Bi{vQ,r + 1) will be included next in some order. For the base case, we use r = rioc/2. Let 
u be any node in Bi{vQ,r + 1) \ Bi{vQ,r), and w a node "close to u on the line from vq to n." More 
formally, w is a node with T>i{vQ, w) < r — rioc/4 and T>i(u, w) < rioc/4 + 0(1). The existence of w 
follows by the near- uniform density assumption. 

By near-uniform density, the ball B' = Bi{w, rioc/4) contains at least 0(2~'^A''amb) nodes, and by 
induction hypothesis, all nodes of B' are neighbors of v. Furthermore, E^g contains edges between 
u and all w with constant probability, so using Chernoff Bounds, with high probability, the pair 
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(u, v) will pass the Amoeba Test for all v G B', inserting all these edges. Once all i-short edges 
between u G Bi{vo,r + 1) and v € Bi{vo,r) have been inserted, the i-short edges between the 
remaining pairs u,v £ Bi(vo, r + 1) will be inserted by the following argument. Node u has z-short 
edges to all nodes in B' (which are already in -Bamb)) so , w) < 2rioc for all w G B' . Thus, each 
edge from v to w G B' is included with probability at least p = Q{Csgksg2~'^r^'^) , and there are at 
least \B'\ > Q{4~'^rf^^) such nodes, implying that the expected number of edges between v and the 
neighborhood of u is at least Q,{8~'^Csgksg). By Chernoff Bounds, we obtain concentration results, 
and because Mamb < Q{8~'^Csgksg), the edge {u,v) will be included with high probability. □ 

The algorithm will thus terminate with i-amoebas E^^^^^ji = 1,. ■ ■ ,K. The distance T)i{u,v) 

(i) 

is now estimated as the shortest-path distance between u and v in E)^^^, multiplied by ramb- By 
Lemma 1131 this gives constant expansion ramb/Hoc = 6(i^^''"'), no contraction, and additive error 

^amb- 

4.5 Efficient Implementation of the Amoeba algorithm 

We outline how to implement the Amoeba algorithm in near-linear time. The first (and perhaps 
most surprising) step is quickly finding the seed clique. Then, we need to execute each Amoeba 
step in (amortized) polylogarithmic time. The resulting algorithm computes the graph E^^^ for 

(i) 

each category i in near- linear time. Recall that E)^^^ is a constant-distortion spanner for T>i, in 
the sense that its shortest-path metric approximates Di. Once we have a spanner, we can compute 
succinct distance labels by adapting a hierarchical beaconing technique from prior work on distance 
labeling and routing schemes (e.g. \32\ [T3| [69l 170]). We next describe each of these steps in more 
detail. 

4.5.1 Finding the seed clique 

By suitably adjusting the threshold Ma, the Two-Hop Test can be modified to accept all node 
pairs that are within distance r[^^ = Srpru in some category, and to reject all node pairs that 
are at distance at least r'pj.^ = Q{K'^^'^ r[^^) in all categories. We run the Amoeba algorithm on 
the pruned graph E'^j.^ obtained by this modified Two-Hop Test. Let r'^^^y^ be the corresponding 
Amoeba radius. To produce the seed cliques for E'^^.^, we use the original Two-Hop Test in the way 
described below. 

Consider the original Two-Hop Test, and let -Epru be the corresponding pruned graph. Let N{u) 
denote the 1-hop neighborhood of node u in Ep^^, including u itself. For a node set S, define N{S) 
to be the intersection N{S) = flues -^(^)- We focus on such intersections for node sets S C N[u) 
of size \S\ = K. 

Lemma 4.8. For any node u and category i, there exists a set S C N[u) of size K such that 
the intersection N{S) contains at least Namb nodes, has diameter at most Srpm in category i, and 
diameter at least R = r^^j, log^(n) in all other categories. 

Proof. Let B = Bi{u,rioc/2). We show that there exists a candidate set S B. Recall that B 
induces a clique in the pruned graph -Eprui so for any subset S C B, we have B C N{S). Since B 
contains at least A'^amb nodes and has diameter at least R in each category j ^ i, N{S) inherits 
these properties. Thus, it remains to ensure that N{S) has low diameter in category i. 
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We claim that Local Category-Disjointness implies the existence of a subset 5" C i? of size K, 
such that any two nodes in S are at distance at least 2rpi.u in each category j ^ i. Consider (for 
the proof only) the following simple algorithm. The algorithm works with two set-valued variables, 
S and U, initialized to 5" = and U = B. It runs the following loop K times: pick any node 
V gU, add this node to S, and remove from U all balls Bj{v,2rpT-^), j ^ i. Clearly, the following 
invariant is maintained after each iteration: any two nodes v € S,w G S U U are at distance at 
least 2rpi.u in any category j ^ i- Therefore, the algorithm finds the desired set S unless U were to 
become empty prematurely. This cannot happen because by Local Category-Disjointness, B and 
any Bj{v,2rpT.^), j / i overlap in at most O(logn) nodes, so the cardinality of U decreases by at 
most 0{Klogn) in each iteration. 

Now fix the subset S guaranteed by the previous paragraph. Consider some node w G N{S). 
For any category j ^ i, there can be at most one node in S within category- j distance rpru from 
w. (If there were two such nodes v,v' £ S then T)j{v,v') < rp^u, a contradiction.) It follows that 
at least one node u G 5 is at distance more than rpru from w in each category j ^ i. Since the 
pruned graph -Bpru contains the edge {vjw), v and w must be close in some category, and we have 
proved that they can only be close in category i. Therefore 'Di{v, w) < rpru- Since 5 C 5, it follows 
that 'Di{u,w) < rpru + noc/2- Therefore, any two nodes in N{S) are at category-z distance at most 
2 rpru + ''loc from one another. □ 

For each iteration i of the Amoeba Stage, we need to find a seed clique C for E'^^^ such that 
\C\ > A'^amb and diamj(C) > log^(n), for each category j < i. By Lemma l48l one such clique is 
given by N(S), for any given node u and some subset S C N{u) of size K. Therefore, we can run 
the original Two-Hop Test to obtain the pruned graph Epj-^, pick any node u, and iterate through 
all K-node subsets S C N{u) until we find a set S such that N{S) is a clique in -Epru- It is easy 
to see that this approach results in running time n poly log (n). In fact, one only needs the initial 
pruning step to be local to node u, so the list of all candidate subsets N(S) can be obtained in 
polylog(n) time. 

4.5.2 Efficient implementation of the Amoeba step 

To implement the Amoeba step efficiently, we use a queue which initially contains all edges. In each 
Amoeba step, edges are popped from the queue until one is found that satisfies Condition ([5]) holds. 
Once an edge (u, v) satisfies this condition, it is added to the amoeba, while all its adjacent edges 
are (re-) enqueued. Any one edge is adjacent to at most polylogarithmically many other edges, and 
can therefore be enqueued at most polylogarithmically many times. Thus the entire growth phase of 
the Amoeba algorithm is implemented in n polylog(n) running time. The following argument shows 
the correctness of this queue policy: If an edge {u,v) is checked and does not satisfy Condition ([5]), 
then it can satisfy this condition at some later point only if another edge incident to or has 
been added to the Amoeba, i.e., only if {u,v) is re-enqueued. 

4.5.3 Prom a spanner to succinct distance labels 

Fix a category i. For the remainder of this section, all "balls" and "distances" refer to category i. 

(i) 

We use the spanner -Bamb = -E^amb Produced by the Amoeba algorithm to produce distance labels 
for Di of polylogarithmic size, so that for any two nodes u, v the distance 2?j(n, v) can be estimated 
with constant distortion from their labels alone (in polylogarithmic time). 
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Consider exponentially increasing distance scales r. For each distance scale r, pick kr scale-r 
beacon nodes independently and uniformly at random; kr is chosen so that with high probability, 
each ball of radius r contains 0(logn) scale-r beacon nodes; For each scale-r beacon b, run a 
breadth-first search in -Eamb for 0(r) steps, to compute distance estimates between b and all nodes 
within distance 0(r) from b. Simple accounting shows that computing the estimates for all scales 
and all beacons takes npolylog(n) time. 

Thus, for every given node u, we have computed distance estimates between u and some subset 
Su of beacons. Su includes all scale-r beacons within distance 0(r) from u, for each scale r. 
Together, these distance estimates constitute u's distance label. Given the distance labels of two 
nodes u and v, one can reconstruct the distance estimate for the pair {u, v) by picking the beacon 
b £ SuCi Sy closest to node u, and using the distance estimate for the pair (6, v) as an estimate for 



5 Improving the distortion for a single category 

Our first improvement is to reduce the distortion from a multiplicative constant to a factor l + o(l). 
In fact, under stronger assumptions on the uniformity of the metric space, we will be able to reduce 
the distortion to additively polylogarithmic. We first show the improvement for a single category, 
and discuss the necessary extensions for multiple categories in Section [6l 

In trying to improve the distortion beyond a multiplicative constant, we face an immediate 
obstacle: as discussed in Section [3l an algorithm can estimate the normalization constant Cgg 
and the target degree fcgg only up to a constant factor. However, for further improvements of the 
distortion, more accurate estimates of Cgg and kgg appear to be necessary. In order to side-step 
this technical obstacle, we define normalized distances 



and we focus on J\f instead of actual distances as the quantities to be inferred. 

Note that Theorem 14.11 can also be interpreted to yield an estimate M* for which with high 
probability has no contraction, constant expansion and polylog(n) additive error. In this section, 
we improve this bound to unit distortion with sub-linear additive error. 

Theorem 5.1. Consider a single-category social graph of dimension d, with Csgkgg = r2(logn) and 
near-uniform density. There is a polynomial-time algorithm that w.h.p. reconstructs each normal- 
ized distance J\f{u,v) with additive error ^M'^ log'^'-"'^^ n, where 7 = 2d+2 ■ algorithm runs in 
polynomial time. 

The high-level idea is to augment the Two-Hop Test from Section [J] with a post-processing step 
we call Two-Ball Algorithm. This is a variation of the common neighbors heuristic where instead 
of common neighbors, the algorithm counts 3-hop paths whose first and last hops are sufficiently 
short according to the initial estimates. More precisely, to estimate AA(s,t), the algorithm counts 
edges between two node sets i?* and that are small balls (centered at s and t, respectively) with 
respect to the initial estimates M* . 

The Two-Ball Algorithm proceeds as follows. The input consists of N* and the original edge 
set E'sg- For every two nodes s and t, the normalized distance AA(s, t) is estimated as follows. Let 
Bu{K;J\f*) be the set of the k closest nodes to node u according to A/"*, breaking ties arbitrarily; note 
that this set is — up to tie-breaking — a ball with respect to M* . Consider balls B* = Bs{n]J\f*) 



{u,v). 




(6) 
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and B* = Bt{K;M*), for some cardinality k to be specified later. Count the number of edges in 
Egg between B* and B^ , and let Mg^t be that number. The new estimate is 

M'{s,t) = [^/Ms,t) . 

We take k = r^, where rx = and x = J\f*{s,t). See Algorithm [2] for the pseudocode. 

Algorithm 2 The Two-Ball Algorithm. 

Inputs. Original edge set E^g and initial estimates Af* from Theorem 14.11 
Output. Improved distance estimates M' . 
For each node pair (s,t): 

1. B* = Bs{K;J\f*) and B^ = Bt{K;Af*), where k = a;'^(rf+2)/(2<i+2) x = J\f*{s,t). 

2. Mg t is the number of edges in E^g between i?* and B^.. 

3. AA'(s,t) = (KVM,,i)i/f 

Notation. Bu (k; N* ) is the set of the k closest nodes to u according to N* , breaking ties arbitrarily. 



The idea is that E 



N '^{s,t), and our estimate inverts this relation. We pick k to 
optimize the trade-off between the "spatial uncertainty" (the pairwise distances between nodes in 
B* and B^ are not exactly AA(s,t)) and "sampling uncertainty" (deviations of the number of edges 
from the expectation). The former increases with k, while the latter decreases with k. 

Proof of Theorem \5.1[ Assume that M* satisfies the high-probability property that it is an esti- 
mate of M with constant distortion and polylog(n) additive error. Consider a node pair (s,t), at 
normalized distance y = A/'(s,t). 

Assume that M{s,t) is large enough to ensure that ry is larger than the polylog(n) additive 
error. (Otherwise, the additive error guarantee is trivially satisfied.) Then, by near-uniform density, 
all nodes in Bs{K;J\f*) are at normalized distance at most cry from s, for some constant c. Likewise, 
all nodes in Bt{K.;J\f*) are at normalized distance at most cry from t. Therefore 



{y + 2cryY 



< E 



< 



(y 



2cryy 



(7) 



We next apply Chernoff bounds to Mgf, and use the bounds that 



l+2j3 



• (1 - 6/3) > ^ (with /? 



c— ) to derive that 

y ' 



1-2/3 



(1 + 6/3) < 



and 



Prob 



{y^%cryY 



{y-ScryY 



> 1 - 1/n 



O(logn) 



Taking the union bound over all node pairs {s,t), it follows that w.h.p. {{k^ / Ms^t)^^^ — ?/| < 0{ry). 



□ 



5.1 The Recursive Two-Ball Algorithm 

Given that the Two-Ball Algorithm produces improved estimates of (normalized) distances, it seems 
natural to run the algorithm again, using the improved estimates as a starting point for defining 
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the balls B* and B* more accurately. This suggests a recursive approach: to estimate 'D{s,t), 
the algorithm can use the previously computed estimates for smaller distance scales to define B* 
and B^ . We call the resulting algorithm (with carefully optimized distance scales) the Recursive 
Two-Ball Algorithm. The technical goal is to improve the additive error in Theorem 15. 11 

The analysis of this algorithm is significantly more delicate and involved. In particular, in order 
to take advantage of the improved estimates, a stronger uniformity condition is needed on the 
metric: we say that the metric space has perfectly uniform density iff each ball of radius r contains 
CpD r'^ ± 0{r'^~^) points, where Cpd is a known constant. Then we can improve the additive error 
to polylog(n). 

Theorem 5.2. Consider a single- category social graph with Csgksg = r2(logn) and perfectly uniform 
density. Assume that the social distance is defined by the £2 norm, with d > 2. Then, the Recursive 
Two-Ball Algorithm w.h.p. reconstructs all normalized distances with unit distortion and additive 
error polylog(n). 

Remark. The algorithm uses a constant q that captures, up to the first-order term, how the 
expected number of edges between two radius-r balls depends on r and the distance between 
centers. Specifically, in the setting of Theorem 15. 2 1 consider two radius-r balls whose centers are 
at distance x > 4r. The expected number of edges between these two balls is (cdr'^/x)'^, up to a 
multiplicative factor 1 -|- 0{r~^). Here, c^ is a constant that depends only on the dimension d and 
the constant Cpd in the definition of perfectly uniform density. We assume that Cd is known to the 
algorithm. 

The restriction to the £2 norm is essential to define Cd- under ip, p 7^ 2, the expected number 
of edges between the two balls significantly depends on the alignment of the s-t line relative to the 
coordinate axes. 

Remark. For d = 2, a similar (but slightly more complicated) algorithm and analysis yield additive 
error 

20(v^) for node pairs at normalized distance x; we omit the details. 

We next define the algorithm. Let us first set up the notation. Let M* be the normalized 
distance estimates guaranteed by Theorem 14.11 We will compute refined estimates M' , which are 
initialized to M* . Let Bu{k]M') be the set of the k closest nodes to u according to M' , breaking 
ties arbitrarily. 

The Recursive Two-Ball Algorithm proceeds as follows. The input consists of N* and the 
original edge set E^g. The algorithm considers node pairs (s,t) such that M*{s,t) > polylog(n), 
in order of increasing A^*. For each such node pair, we define balls around s and t whose radius 
is roughly f^, where x = M*{s,t) and rx = x^l'^^^l'^. Formally, we define balls B'^ = Bs{n;M') 
and B[ = Bt{K;J\f'), where k = Cpd^^. Note that these balls are defined with respect to the 
improved estimates Af'. Let Mg^t be the number of edges between B'^ and B'^. The new estimate 
is Af'{s,t) = Cd ^st^'^- The pseudocode is shown in Algorithm [3l Note that the algorithm is 
quite simple; the only complication is how to pick k as a function of x = M*{s,t). 

5.2 Proof of Theorem lOl 

The high-level idea of the analysis is as follows. Let a{x) be the maximum additive error for node 
pairs at normalized distance at most x. As in the Two-Hop Test, the error comes from two sources: 
spatial uncertainty and sampling uncertainty. We show that the spatial uncertainty can contribute 



23 



Algorithm 3 The Recursive Two-Ball Algorithm. 

Inputs. Original edge set E^g and initial estimates M* from Theorem 14.11 
Output. Improved distance estimates A/"'. 

For each node pair {s,t) such that J\f*{s,t) > polylog(n), in order of increasing M*: 

1. K = CpD r^, where x = Af*{s,t) and = xVs+V'i. 

2. B'^ = Bs{k;M') and B[ = Bt{K;M'). 

3. Mg^t is the number of edges in E^^ between B'^ and B[. 

4. M'{s,t) = c,rlM-l'''. 

Notation. Bu{k;M') is the set of the k closest nodes to node u according to A/"', breaking ties 
arbitrarily. 

Cfi is the constant from the remark after Theorem 15. 2i 



at most 0{a{'rx)) to the overall additive error; interestingly, this holds for any choice of fx- We 
use Chernoff Bounds to bound the contribution of sampling uncertainty by 0{a{'rx)) as well; this is 
where the particular exponent in r^. is used. It follows that a{x) = 0{a{fx)). Finally, the distance 
estimates for a given node pair implicitly rely on recursion from distance scale x to distance scale 
fx- Let p{x) be the depth of this recursion: the number of steps until the distance scale goes below 
polylog(n). It is easy to see that a{x) = and that p{x) = O(loglogn). 

Consider two nodes s and t whose normalized distance is x = J\f{s,t). 

Let Bg = Bu{k,\M) and Bt = Bu{k;M) be the sets of the k closest nodes to s and t, respectively, 
under the (correct) normalized distances (y, AA). 

We start with a simple lemma showing that this choice implies that the actual sets of nodes are 
very close between B'g and Bg (and B'^ and Bt, respectively). 

Lemma 5.3. For a sufficiently large constant (3, we have that 

Bj^{s, fx - 2a{fx) - /?) C C Bj^is, fx + 2a{fx) + /3), 
Bj^it, fx - 2a{fx) - /3) C C Bj^{t, fx + 2a{fx) + /3). 

Proof. We first prove the first inclusion. Let v G Bj\f{s,fx — 2a{fx) — (3) be arbitrary. Because 
M{s,v) < fx — 2a{fx) — /3, the definition of a(-) implies that M'{s,v) < A/'(s,t') + a{fx) < fx — 
a{fx) — (3- On the other hand, M'{s, u) > fx — a(fx) — f3 for all nodes u such that Af{s, u) > fx — f3- 
Therefore, there can be at most Cpd {fx — I^Y 0{{fx — /3)'^~^) nodes u with J\f'{s, u) < Af'{s, v). 
This number is less than Cpd = k whenever /3 is large enough. 

Because B'^ contains the k nodes closest to s under M' (by its definition), this means that 
V £ B'g. Since this argument holds for arbitrary v, we have proved the first claim. The second 
inclusion is proved by an analogous calculation. □ 

We next show that the number of edges between Bg and Bt is close to the number of edges 
between B'g and B't. To state this claim concisely, let #edges(S', S') be the number of edges in Egg 
between node sets S and S'. 



Lemma 5.4. With high probability, |E 



#edges(^^,S;) -E #edgesiBg,Bt) \ = 0{x ■ a{fx)) 
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Proof. We construct a bijection (j) : (B'^ U B'-f.) — )• {Bs U Bf) as follows. Partition the domain and the 
co-domain into four disjoint regions each (using © to denote the disjoint union of sets): 



(B'sUB't) 
{Bs U Bt) 



(BinBs) 
(B'snBs) 



{Bi\Bs)®{BlnBt) 
[Bs\Bi)(B{B'tnBt) 



{B't\Bt\ 
iBt\B't). 



The regions in each partition are indeed disjoint because Bg H Bt = B^ n B^ = ill. We define (j) 
separately for each of the four subsets the domain. First, any node in {B'g D Bs) or {B't D Bt) is 
mapped to itself. Second, (p is an arbitrary bijection {B'g\Bs) — )■ {Bs\B'g) and {Bl\Bt) — )■ (i?f 
This completes the definition. For the second step, note that the respective domains and co-domains 



have the same size; this is because \B^ 



IB'J = K, and \Bt 



\Bi\ 



K. 



Nodes V G {B'^ \ Bs) U {B't \ Bt) called perturbed nodes. By Lemma [5^ B'^ and B't contain at 
most CpD • {2a{fx) + /3) • r^~^ perturbed nodes each. 

By the perfectly uniform density assumption, at least Cpd r'^ — 0{r'^~^) nodes have distance at 
most r from s. In particular, setting r = + /? gives us that at least k nodes satisfy the distance 
bound, implying that every node u & Bs satisfies AA(s,ti) < rx + /3, Furthermore, by the second 
inclusion of Lemma 15.31 every node v £ B'^ satisfies AA(s,f) < rx + 2a{fx) + (3. Similar bounds 
apply for t. We thus get that M{v,(f){v)) < 2rx + 2a{fx) + 0(1) < 3r^ for all v, and of course 
A/'(f , (/'(f)) = for unperturbed nodes v. 

Now consider a pair u £ B'^ and v £ B[ such that at least one of u, v is perturbed. (We call such 
a pair a perturbed pair.) By triangle inequality, \N{v, u) — M {(j){v) , (j){u))\ < 6fx, and the number of 
perturbed pairs is at most {Aa{rx) + 2/3) • f'^~^, by the bound on the number of perturbed nodes. 

Next, we bound how much a single perturbed pair u £ Bg,v £ B't affects the expected number 
of edges between the balls. Because x + 6rx > -^(^(ti), (/'(w)) > x — 6rx, we get that 



M{u,v) 



M{mA{v)) 



£l±0{fx/x). 



We can now express the difference between the probabilities of the edges {4>{u), (j){v)) and {u, v) as 



{x±0{fx))-''-{x±2fxy 



X 



1 ± 



0{rx] 



O X 



2r^ 

X 

1 



l±0{rx/x) l±2rx/x 
= 0(x-'^-rx/xy 

In the second step, we truncated the Binomial expansion (because fx/x = o{l/d)), and the final 
step again used that fx/x is small. Summing over all perturbed pairs, the total expected difference 
in the number of edges can be bounded by above as follows: 



E 



#edges{B'„ B't) " #edges(S„ Bt) < O • x""^ • a{fx) ■ fl' 



0{xa{fx)), 



where the last step was obtained by substituting the definition of fx ■ The concentration now follows 
from Chernoff Bounds. □ 



Lemma 5.5. a{x) = 0{a{fx)). 
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Proof. Consider two nodes s and t at normalized distance x. Using an analysis very similar to the 
one in the proof of Lemma 15.41 the expected number of edges between Bs and Bt is {cdr'^/x)'^ it 
0{x) = c^x^ lb 0{x) (where Cd is the constant from the remark after Theorem 15. 2p . Vx is chosen 
so that Chernoff Bounds ensure that w.h.p., the actual number of edges between Bg and Bt does 
not deviate from its expectation by more than 0{x ■ a{fx)). Combining this number of edges 
with the bound from Lemma 15.41 the expected number of edges between B'^ and B'^ is Mg^t = 
c^x^ lb 0(x-a(rx)) with high probability. (The big-0 term combines both the misestimates bounded 
by the Chernoff Bound and the ones from Lemma 15.41 ) 



Because the algorithm estimates the distance as Cd'T^^s t'" ^ additive distortion is at most 



-l/d 



X ■ 



X 



X ■ 



CdX 



2/d 



(c^x2 ±0(xa(r^)))Vd 



cj^x±0{a{f.,)) 



d 
1 lb 



l/d 



0(a(r,)) 



< X ■ 



4x±0{a{fx)) 
0(a(r,)) 



l/d 



4x±0{a{rx)) 
< 0(a(f,)). 

In the penultimate inequality, we used that |1 — (1 ib (^)^/'^| < 6 for any 5, and the final inequality 
used that a{rx) = o{x) to simplify the denominator. □ 

The distance estimates for a given node pair implicitly rely on recursion from distance scale x 
to distance scale f^. Let p{x) be the depth of this recursion: the number of steps until the distance 
scale goes below polylog(n). It is easy to see that a(x) = 2^^^^^^^^ and that p{x) = O(loglogn). 
This completes the proof of Theorem [ 



6 Improving the distortion for multiple categories 

In order to improve the estimates for multiple categories, we employ the two algorithms from 
Section [5l The main difference with the single-category case is that when we count the number of 
edges between the balls in the original multi-category social graph graph for some category i, some 
of these edges may come from other categories, which might affect the estimation. We would like 
to claim that the number of edges from other categories between the two balls is small compared to 
the number of edges from category i. Unfortunately, such a claim does not follow from the Local 
Category-Disjointness condition, which prompts the following stronger condition. 

The stronger condition, called Scale- i? Category-Disjointness, states that at all scales up to R, 
categories look essentially "random" with respect to one another. More specifically, given a pair of 
balls B, B' in some category i, we count the number of node pairs (n, u'), u & B, u' € B' such that 
u and u' are close in some other category j: 

#pairs j{B,B',r) = \{{u,u) \u£ B,u £ B' , Vj{u,u) < r}\. (8) 
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If the node identifiers within each category are permuted randomly, then the expected number of 
such node pairs is Q{r'^/n) ■ \B\ \B'\, and with high probability, the deviations are bounded by: 

#pairsj.(5,5',r) < 0(r'^/n) • \B\ \B'\ +0{log^n). (9) 

Scale- i? Category-Disjointness asserts that Q holds "locally:" at all distance scales up to R. 

Definition 6.1. The Scale-R Category-Disjointness condition states that ([9]) holds for any two 
categories i ^ j, any two disjoint category-z balls B, B' with \B\ ■ \B'\ < i?^, and any r G (0, -R]. 

Remark. Equation ([9|) for randomly permuted categories is derived in Section [8l The expectation 
is relatively easy to derive, whereas the high-probability guarantee requires a more careful analysis. 
We obtain (a slightly weaker version of) Local Category-Disjointness as a special case if i? = 
polylog(n) and B is restricted to be a single node. 

We will improve over the constant distortion under the condition above. We present two results: 
an extension of the Two-Ball Algorithm (Section 16. Ih and an analysis of the Recursive Two-Ball 
Algorithm for multiple categories (Section 16. 2p . 

Like in the single-category case, we focus on normalized distances. For each category i, let 
Csg and kig be the normalization constant and the target degree, respectively. The normalized 
category-i distance between nodes u,v £V is J\fi{u,v) = Vi{u,v)/{Csg kig)^^'^. 



6.1 The Extended Two-Ball Algorithm 

The Scale-i? Category-Disjointness condition does not apply to distance scales beyond R, and 
even for i? = oo, the guarantee of Equation Q is quite weak at very large scales. Accordingly, 
we find that the Two-Ball Algorithm becomes problematic at large distance scales. To deal with 
these issues, we apply the Two-Ball Algorithm only to distance scales small enough to provide 
strong guarantees. The improved distance estimates define edge lengths, and a post-processing 
step computes shortest paths with respect to these edge lengths. The resulting algorithm, called 
Extended Two-Ball Algorithm, satisfies the following theorem. 

Theorem 6.2. Assume the setting of Theorem \4-l\ with Scale-R^~^^^^^^^^ Category-Disjointness, 
R > polylog(ri) for a sufficiently large polylog(n). Then, the Extended Two-Ball Algorithm runs 
in polynomial time, and with high probability produces distance estimates M- with the following 
guarantee: 

For any pair (s,t) at normalized distance x = Mi{s,t), the estimate A/J(s,t) has multi- 
plicative distortion l± (min(a;, i?, i?))"'^/(2a+2) . 0(log n) , where R = ( 

Remark. The distortion in Theorem 16.21 can be interpreted as 1 it O ( ^"'^Z'-^'^'^^-' -log^n), where 
I = min(x, R) is, in some sense, the effective distance scale. 

We begin by defining the Extended Two-Ball Algorithm precisely. The input consists of the 
multi-category social graph and the distance estimates M* = Ml for a given category i, as guaran- 
teed by Theorem 14. li Recall that these are non-contracting estimates with constant expansion 5 
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and polylog(n) additive error; we assume that (an upper bound on) 6 is known to the algorithm. 
Apart from 6, the algorithm is parameterized by the distance scale R from Theorem 16.21 

The algorithm proceeds as follows. (See Algorithm S] for the pseudocode). It focuses on the 
edge set H = {{u,v) \ N'*{u,v) < R}. For each edge {u,v) G H, it applies the Two-Ball Algorithm 
with respect to distances A/"* to obtain improved distance estimates A/'h(u, v). These improved 
estimates are treated as edge lengths for H. For each node pair {s,t), we distinguish two cases. If 
the edge {s,t) is in H, we simply set the final estimate A/'/(s,t) = Afnisjt). Otherwise, the final 
distance estimate Ml{s,t) is the length of the shortest s-t path using the edge set 

Ht = {{u,v) G H I M*{u,v) oTv = t}. (10) 

In other words, the distance is estimated by the length of the shortest path using only "sufficiently 
long" edges, except for possibly the last edge, which may be short. 

Algorithm 4 The Extended Two-Ball Algorithm (for a given category i). 
Inputs. Original edge set Egg and initial estimates M* = A/^* from Theorem 14. 1[ 
Parameters. Distance scale R and expansion 6 of M*. 
Output. Improved distance estimates A/J. 

H = {{u,v) I M*{u,v) < R}. 

The Two-Ball Algorithm. For each node pair (s, t) G H, 

1. B* = Bs{k;J^*) and = Bt{K;Af*), where k = x"'('^+2)/(2rf+2) ^ ^ M*{s,t). 

2. Ms t is the number of edges in Esg between B* and B^ . 

Post-processing. For each node pair (s,t). 
If {s,t) G H, then7Vj(s,t) =AfHis,t); otherwise 

1. Ht = {iu,v) G H I Af*{u,v) >§OTV = t}. 

2. A/'/(s,t) is the length of the shortest s-t path in Ht with respect to edge lengths Mh- 
Notation. Bu{k', M* ) is the set of the k, closest nodes to u according to M* , breaking ties arbitrarily. 



6.1.1 Analysis: the Two-Ball Algorithm for multiple categories 

We begin the analysis by showing that for sufficiently small distances. Scale- i? Category-Disjointness 
ensures that the basic Two-Ball Algorithm gives accurate estimates. 

Lemma 6.3. Assume that the Scale-R^^^^^'^^^^ Category-Disjointness condition holds, and let {s, t) 
be a node pair at normalized category-i distance J\fi{s,t) = x < R. Then, the Two-Ball Algorithm 
obtains a distance estimate Afl{s,t) of Mi{s,t) with the following guarantee: 

\Mi{s,t)-K{s,t)\ < {xi<'+m2d+2) ^ . o(log2n). 

Proof. Recall from the proof of Theorem 15.11 that to estimate Mi{s,t), the Two-Ball Algorithm 
considers two balls B*, Bt around s and t, respectively, and counts edges between them. The 
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balls were chosen so that \B*\ = \B*\ = k = r^, where = . The improved distance 

estimate is M'{s,t) = {k^ /Mg^t)^^'^, where Ms,t is the number of edges between B* and B* . 

If only edges from Esg were counted, Theorem 15.11 would apply verbatim. However, edges 
between B* and B^ from other categories can be erroneously included in the count. The presence 
of other categories never decreases Mg^, so the high-probability lower bound on Mg^t, and hence 
the high-probability upper bound on J\f'{s,t), carries over from Theorem 15.11 

We need to prove a lower bound on J\f'{s, t). Let m]*] be the number of category-i edges between 
B* and B* . In the proof of Theorem 15.11 we showed that with high probability, Mg^} < ^^_g^^ y , 

for some constant c. This implies Afj*] < ^(1 + 0{cr.j;/x))'^ < ^(1 + 0{cdr,j:/x)). 

We next count edges from other categories between B* and B^. Fix some category j ^ i, and 
consider node pairs {u £ B*,u' £ B^). We distinguish between two distance scales for Mj{u,u'). 

1. We first consider the case that J\fj{u, u') > i?i+V(rf+i). The probability for the edge (u, u') to 
exist is then at most 0{R~^'^~^^~^^^^~^^^^) . The number of candidate pairs {u,u') is at most 
^2 _ ^d+i-i/(d+i) ^ ^a!+i-i/(d+i)^ gQ ^]^g expected number of such long edges is 0(1). Using 
Chernoff Bounds, with high probability, the number of long edges is at most 0(log^ n). 

2. The other case is Mj{u,u' ) < We divide the range of possible distances into 
exponentially increasing buckets of the form {y,2y]. Suppose that y < Mj{u,u') < 2y (for 
some y < R/2). Then, the pair {u,u') has an edge with probability at most 0{y~'^), and by 
the Scale- i?^''"^/^'^'^^^ Category-Disjointness condition, there are at most 0{y'^/n) • |^*| + 
O(log^n) pairs {u,u') at this distance scale. Using linearity of expectations, and summing 
over all O(logn) distance scales y, we obtain that the expected number of short category-j 
edges between B* and B^ is at most o^l^J-L^iiiHili _|_ log^ n), and Chernoff Bounds establish 
concentration. 

Combining both cases, and substituting that \Bg \ = \B^ \ = k gives us that with high probability, 
the number of category-j edges between B* and B^ is at most 0(^^^ • + log^n). Combining 

these edges across all categories j ^ i and plugging in the upper bound for M^l, we obtain: 

Mg, < ^ (l + O (c'i 5) ) + OiK) + log^ n) . 

Adding some logn factors for simplification, and hiding the constants inside O(-), we can re- write 
this bound as follows: 

Ms,t < ^ (l + 0(log^ n) (x-'^/(2'^+2) + ^)) . 
Substituting the definition M'{s,t) = {k^ / Mg^tY^'^ , it follows that 

MI{S, t)>x(l- 0(log2 n) (x-'^/(2d+2) + > ^ _ o(iog2 (^^(d+2)/i2d+2) + ^) . □ 

6.1.2 Analysis: the post-processing step 

Theorem 16.21 easily follows from Lemma 16.31 and the following Lemma 16. 4|, which analyzes the 
post-processing step. The lemma is not specific to the actual estimates produced by the Two-Ball 
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Algorithm. Instead, it states that if each individual edge's length is estimated with small additive 
distortion (compared to the true edge length), then the multiplicative distortion of the overall 
estimates is small. For readability, we continue to omit the subscript i from all metrics. 

Lemma 6.4. Assume the setting of Theorem \4-l\ and let 5 be the expansion in M* . Consider 
running the post-processing step of the Extended Two-Ball Algorithm (parameterized by some R) 
on distance estimates Mh satisfying the following for some A < : 

|7ViT(n,?;) -7^(ti,i;)| < A for all {u,v) £ H. (11) 

Then, the final estimates J\f'{s,t) have multiplicative distortion 1 + 0(6'^ A/ R) for all node pairs 
(s, t) not in H . 

Proof of Theorem \6.2l Without loss of generality, assume that R < R, where R is from the theorem 
statement. (If R > R, then we could parameterize the algorithm by R instead.) Then the upper 
bound in Lemma [631 becomes A^ = a;('^+2)/(2d+2) . Qilog^n). 

To complete the proof of Theorem 16.21 notice that all edges (u, v) E H, by definition, satisfy 
M*{u,v) < R. As Af* is non-contracting, this also implies that A/'(ii, v) < R, so the bound (jlip 
holds with A = A/j, according to Lemma l6.3[ If {s,t) G H (which happens when J\f*{s,t) < R), 
then we can apply Lemma 16.31 directly to the edge (s,t), obtaining the bound in terms of x. □ 

Proof of Lemma \6.4\ Fix a node pair (s,t) ^ H, and let x = J\f{s,t). Because {s,t) ^ H, and the 
estimate J\f* has expansion at most 6, we get that Af{s,t) > ^Af*{s,t) > ^. Let Ht (1 H he the 
edge set defined in (jlOp . and for any path P, let M{P) the length of the path P according to the 
distance function Af. 

We claim that the edge set Ht contains an s-t path P with k = \x/{^ — 1)] hops and length 
M{P) < M{s,t) + k. Consider the straight line between s and t in W^. For each i, let pi be the 
point at AA-distance i ■ — 1) from s on the straight line between s and t. The point pi itself may 
not be the location of any node in the social network. However, by near-uniform density (which 
guarantees that every unit cube contains at least one node of the network), each point pi has a 
node Ui at distance at most T>{pi,Ui) < d. Thus, J\f{pi,Ui) < d/(Csg fcgg)^^'^ < ^ for large enough 
n, as Cgg/csg = O(logn). 

Let P be the path {s = uo,ui,U2, ■ ■ ■ , Uk-i,t = Uk). By triangle inequality, all edges (uj, Uj+i) S 
P have AA-length within ±1 of the distance P(pj,pj_|_i) between the corresponding points pi. There- 
fore, M{P) < J\f{s,t) -\- k. Moreover, because each edge {u,v) G P satisfies M{u,v) < the fact 
that M* has expansion at most 6 implies that N'*{u,v) < R. In particular, each edge of P is 
in H. Furthermore, all edges {ui,Ui+i) G P except possibly the last one satisfy AA*(uj,Mj+i) > 
AA(iti, lij+i) > — 2. By definition of Ht, it follows that the path P is in Ht, completing the proof 
of the claim. 

Next, we upper-bound the estimated distance A/''(s, t). Simply using the path P we just exhib- 
ited, we see that 

inii 

Ar'{s,t) < Mh{P) < Ar{P) + kA < Af{s,t) + k{A + l), 

where the last inequality used the property that M{P) < A/'(s, t)+k. An upper bound of l+0{k6/ R) 
on the expansion now follows by substituting k = 0{x ■^). 

It remains to bound the contraction, by proving that each s-t path P in Ht has Nh{P) ^ 
M{s,t) — 0{x 6^ A/ R). By the same argument as in the preceding paragraph, this holds whenever 
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P has at most AxS'^/R hops. We therefore focus on the case when P has at least 4x 5'^ / R hops. 
Each of these hops {u,v), except possibly the last one, has J\f*{u,v) > ^ by definition of H. In 
turn, by the maximum expansion of J\f* , the actual length of each hop is at least M{u,v) > so 
that the estimates Mh satisfy Mh{u,v) > J\f{u,v) — A > because we assumed that A < 
Summing over all (at least) Ax6'^/R hops {u,v), we obtain that MniP) > x = J\f{s,t), so in this 
case, the estimate has no contraction at all. This completes the proof of the lower bound. □ 



6.2 The Recursive Two-Ball Algorithm for multiple categories 

We show that the Recursive Two-Ball Algorithm from Section 15.11 can be applied verbatim in the 
case of multiple categories with Scale-oo Category-Disjointness, yielding poly-logarithmic additive 
error. The analysis only needs to be modified slightly to deal with edges from other categories. 
However, our guarantees only apply to node pairs at distances x < n^/('^+^^ = D^/i'^+'^)^ where 
D = n^/'^ is the diameter of the metric space. 

Theorem 6.5. Consider a multi- category social graph with Cggksg = f2(logn), with Scale-oo 
Category-Disjointness and perfectly uniform density for each category. Assume that the social 
distance in each category is defined by the £2 norm, with d > 2. Then, the Recursive Two-Ball 
Algorithm runs in polynomial time, and produces distance estimates A/J satisfying the following 
guarantee with high probability: 

For every pair (s,t) of nodes at normalized distance J\fi{s,t) < have that 

\Mi{s,t)-Mi{s,t)\ <polylog(n). 

For normalized distances larger than n^/('^~^^\ even under actual randomly permuted categories, 
the number of edges from other categories grows prohibitively large for large distances; it seems 
unlikely that this obstacle could be easily overcome. 

However, we can use the improved estimates from Theorem 16.51 with the post-processing step 
from the Extended Two-Ball Algorithm (with R = n^/^"'"'"^)). The resulting algorithm estimates 
normalized distances x > R with additive error (x/R) polylog(n). (This follows from the shortest- 
path argument encapsulated in Lemma 16.41 ) 

Proof of Theorem \6.5[ The proof of Theorem 15.21 applies almost verbatim. Recall that the Recur- 
sive Two-Ball Algorithm counts edges between balls B[., B[ around s and t, containing k = Q{r'^) 
nodes each, where rx = x^^'^'^'^^'^ . These balls are calculated with respect to the distances estimated 
by the algorithm in earlier stages. The only added difficulty for the analysis in the case of multiple 
categories is bounding the additional edges between B'^ and B[ arising from categories j 7^ i. 

Notice that there are = 0(x"'"'"^) < 0{x ■ n) pairs of nodes that could have an edge between 
them. Focus on one category j / i, and divide node pairs {u,v),u S B'g,v £ B[ into buckets of 
the form (y, 2y] depending on their distance in category j. By Scale-oo Category-Disjointness, the 
bucket {y,2y\ contains at most 0(^ • \B'g\ \B[\ -\- log^ ?i) = 0{y'^ ■ x + log^ ?i) node pairs. Each 
of these node pairs gives rise to an edge with probability at most 0{y~'^), and summing over all 
O(logn) buckets (y, 2y] gives us that the expected number of category-j edges between B'^ and B'^ 
is at most 0(xlogn -|- log^ n) = 0{xlog'^ n). Using Chernoff Bounds and a union bound over all 
categories, with high probability, the total number of edges added by categories j ^ i \s at most 
0{Kx log^ n). 
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Because log^ n = 0{a{rx)) for sufficiently large poly-logarithmic x, the 0{Kx log^ n) = 0{x a(r^)) 
additional edges are easily subsumed in the error bound of 0{x a{'rx)) already present in the proof 
of Lemma l5.5i For smaller distances x, the only change will be a slightly different poly-logarithmic 
base case for a{rx)- D 



7 Constant target degree 

The analysis so far has relied heavily on the fact that the target degree k^g (essentially the expected 
average node degree) was at least logarithmic. Indeed, as discussed in Section [3l the first obvious 
problem with constant expected degree is that with non-negligible probability, the social graph iJgg 
is disconnected. To circumvent this problem, much of the past literature (e.g., \12\ HOj W\\ 161] ) 
assumes that in addition to the random edges, the network also contains a set E\oc of local edges 
deterministicallyEl In the literature, -Eioc is frequently the d-dimensional grid. We adopt a more 
general model in which ii^ioc can be essentially any set of short edges. A constant target degree 
poses two additional challenges beyond mere connectivity: 

• There are insufficiently many long-range links to support pruning via counting common neigh- 
bors. Even for short distances, the number of common neighbors is only constant, and high- 
probability guarantees can therefore not be obtainedJil Therefore, in order to identify short 
edges as such, we need to rely on the structure of i^ioc- 

• To avoid stochastic dependence between multiple stages (such as the Two-Hop Test and Two- 
Ball Algorithm), we had previously partitioned E^g randomly into separate sets to be used in 
the stages. With constant node degrees, this may risk leaving the Two-Hop Test with only 
half of the local edges .Bioc- Hence, partitioning the edges may not be viable any more. On 
the other hand, if the same edges are used in multiple stages, subtle stochastic dependencies 
between the stages are created; our analysis needs to carefully account for these dependencies. 

In this section, we explore the changes (in modeling, algorithms and analysis) necessary to deal 
with constant target degrees. We focus on the single-category case for the remainder of the section. 
Our results apply so long as the set of local edges is "rich enough" in local connectivity. 

Definition 7.1 ("Richness" of local edges). 1. An edge set £" is a (a, 6) -spanner if its shortest- 
path distance satisfies the following for all node pairs {u,v): 

a-V{u,v) < V^P{u,v) < 6-V{u,v) 

2. A set E of edges is {b,h)- connected if for every edge {u,v) £ E, E contains b edge-disjoint 
u-v paths of at most h edges each. 

3. Eioc is (6, h)-rich with distortion {a, 5) if it is a {a, (5)-spanner and contains a (6, /i)-connected 
(<T, (5)-spanner E C Eiq^ (called its connectivity witness). 

Without loss of generality, Eioc can also include all edges which would be included by the basic small-world model 
with probability 1. 

^^See, e.g., the difficulties faced by [29]. The authors of [29] consider a small- world model with one random neighbor 
for each node. They can only make guarantees about pruning away all but a poly-logarithmic number of long-range 
edges. The main reason is that even distant nodes will choose the same random neighbor with probability Q{l/n), and 
high-probability bounds therefore only guarantee at most poly-logarithmically many long random edges to remain. 
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Remark. As an example, the d-dimensional toroidal grid is {2d — l,3)-rich and (for d>2) {2d, 7)- 
rich, both with distortion (1. 0(1))!^ 

Next we present a solution which relies on knowing parameters {b, h) of the local structure's 
richness. In other words, the pruning algorithm needs to know how rich a local structure to expect. 
In Section 17. 2| we show how to make the pruning algorithm adapt to the available richness under 
fairly mild assumptions. 

7.1 Basic Approach: Edge-Disjoint Paths 

Our solution is based on a more careful design of the pruning stage, where instead of counting 
common neighbors, the algorithm counts edge-disjoint paths of bounded length. The pruning stage 
is very simple: The algorithm starts with an edge set E = Egg. It prunes each edge (n, v) € E such 
that E does not contain b edge-disjoint u-v paths of at most h hops each. This is repeated until no 
more edges can be pruned. We call this algorithm the {b, h)-EDP Pruning Algorithm; here, EDP 
stands for Edge-Disjoint Paths. See Algorithm [5] for pseudocode. 

Algorithm 5 The (6, /i)-EDP Pruning Algorithm. 

Input. Edge set E. 

Repeat 

1. Find any {u,v) € E s.t. E does not contain b edge-disjoint u-v paths of at most h hops each. 

2. Prune (n, v) from E. 
Until no such edges (u, v) remain. 



The idea is that this algorithm keeps a sufficiently rich subset of local edges, and prunes all 
edges in Esg whose length exceeds some threshold tedp (defined in Equation (fT2]) ). (We call such 
edges long edges.) For edges of intermediate length, the algorithm makes no guarantees about 
whether they are pruned. Crucially, the pruned graph does not depend on the long edges, in the 
following sense: Let £^sg, E'gg be two edge sets generated according to the same distribution, such 
that the random choices for non-long edges are the same, and the random choices for long edges 
are independent. Then, with high probability (over the random process generating all edges of E^g 
and Esg), the remaining set of edges after pruning is the same for both E^g and E^g. The advantage 
of this guarantee is that we do not need to worry about dependencies on the pruned graph, so 
long as the post-processing stage only uses long edges. Therefore, we can use the pruned graph 
to define the initial estimates N* for normalized distances and then use a suitably modified and 
optimized version of the (Recursive) Two-Ball Algorithm which only considers node pairs (s, t) for 
which N'*{s, t) is sufficiently large. We omit the (easy) modifications of the algorithm and analysis. 

We start the analysis of the (6, /i)-EDP Pruning Algorithm with several observations. First, 
notice that the pruned graph T{E) is the maximal {b, h)-connected subset of E, i.e., the union of 
all such subsets. It follows that T{E) does not depend on the order in which the edges are pruned. 
Second, because T{E) is the maximal (6, /i)-connected subset of E, the pruned graph T{E) does 

^*Fix an edge {u, v). As a base case, for d = 2, it is easy to construct three paths of lengths (1, 3, 3), or four paths 
of lengths (1, 3, 5, 7). For each added dimension, there are two additional disjoint paths of length 3, taking one edge 
along the new dimension, an edge parallel to {u, v), and another edge in the new dimension. These paths are clearly 
disjoint. 
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not depend on the presence or absence of the pruned edges e £ E \ T[E). Formally, T{E) = T{E') 
whenever T{E) C E' C E. 

To ensure correctness, we can use the (6, /i)-EDP Pruning Algorithm only if the local structure 
is (6, /i)-rich. The performance depends on the parameters (6, /i): we get better estimates for 
larger b and smaller h. We summarize our results as follows. In a slight abuse of notation, here, 
the (Recursive) Two-Ball Algorithm refers to the suitably modified version that works with the 
(b, h)-EDP Pruning Algorithm. 

Theorem 7.2. Consider a single- category social graph of near-uniform density. Suppose that the 
local edge set Eiof. is (5, h)-rich with distortion {a, 6). Let D = G(?i^/'^) be the diameter of the metric 
space. For any constant a > (which need not be known to the algorithm), let 

rEDp(a) = (o(^^^ + iogi+a^))2Vrf = 0'^^+°'^^ ■ {0{logn)f^''\ (12) 

Let E' be the edge set retained by the (6, h)-EDP Pruning Algorithm. Then, with probability at least 
1 — 0(n~"), the following hold. 

(a) E' contains the connectivity witness E[^^ of Ei^^ and no edges whose length exceeds r-EDp(a). 
The algorithm makes no guarantees for other edges. 

(b) Let "D^P be the shortest-path distance on E' . Then, for all node pairs {u,v), we have that 

V{u,v) < (3V^^{u,v) < 6-(3V{u,v), w/iere /3 = max(i, tedp (a)). 

In words, the shortest paths distance in E' , scaled up by /3, gives no contraction, and expansion 
at most 5 (3. 

(c) The Two-Ball Algorithm reconstructs all normalized distances J\f{u,v) with unit distortion 
and additive error rEop{a){J\f^ {u, v) + rEDp(a)), where 7 = 

(d) Assume that the metric has perfectly uniform density, and the social distance is the £2 norm 
for d > 3 dimensions. Then the Recursive Two-Ball Algorithm reconstructs all normalized 
distances with unit distortion and additive error rEDp(a) • polylog(n). 

Proof. Most of the proof will focus on the first part of the theorem, i.e., that with high probability, 
all edges of length at least rEDp(a) are pruned. The remaining parts then follow analogously to 
previous proofs. The proof of the second part is virtually identical to the proof of Lemma [4.31 The 
analysis of the (Recursive) Two-Ball Algorithm is also similar to the high-degree case, as long we 
we establish the independence between the pruned graph and the long edges: the edges of length 
exceeding rEDp(a). The reason that this independence is sufficient is that the (Recursive) Two-Ball 
Algorithm only uses long edges, and its analysis can then omit any conditioning on the pruned 
graph. 

To prove independence formally, let Egg be a random edge set, and E the set of all its non- 
long edges (of length at most rEDp(a)). Let Egg be another random edge set drawn from the 
same distribution whose non-long edges are also exactly E, while its long edges are generated 
independently from those of Egg. With high probability, the (6, /i)-EDP Pruning Algorithm will 
prune all long edges from both Egg and Egg- By the observation preceding Theorem l7.2l this implies 
that T{Esg) = T{E) and r(^sg) = T{E), so that the (6, /i)-EDP Pruning Algorithm will produce 
the same pruned edge set on both graphs. 
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The remainder of the proof focuses on the first part of the theorem, i.e., the fact that with high 
probability, aU long edges are pruned. The proof involves an intricate Deferred Decisions argument 
encapsulated in Lemma 17.31 below, which may be of interest in its own right. 

Fix parameters {b,h) and a node pair {s,t), and let r = T>(s,t) > rEDp(a)- In applying Lemma 
17.31 we consider the "universal set" U of all node pairs. Recall that the edge set E = Egg includes 
each node pair {u,v) independently with some probability P(^u,v)- The "feasible subsets" of U 
("feasible paths") are all simple s-t paths of at most h hops. Any such path must contain at 
least one hop of length at least ^; the corresponding edge is present with probability at most 
q = Csg fcsg {h/rf. By Lemma [731 we obtain that for each c € N, 

Prob [£'sg contains h disjoint feasible paths] < Prob [ \E'\ > c] + (cq)^, (13) 



where E' is the set of all node pairs {u,v) such that E^g U {{u,v)} contains a feasible path. 

The edge {s,t) is retained with probability at most vr^^^. Once we prove that vr^^f = 0{n~^'^~^°'^), 
we can complete the proof by taking the Union Bound over all node pairs (s,t). So it remains 
to upper-bound the right-hand side of ([13]) by 0(n~(^"'"°)). 

We first bound Prob [\E'\ > c ] in (I13p . Let the random variable A denote the maximum degree 
of Esg. Any node pair {u, v) £ E' has the property that E^g contains both an s-u path and a v-t 
path of length at most h hops each. Therefore, for fixed endpoints (s,t), there are at most A'^ 
candidates for u and at most A^ candidates for v, and thus at most A^'* candidates for {u,v). We 
have thus proved that \E'\ < A^^. Now, using Chernoff Bounds to upper-bound A, we have: 

Prob [A > e(A;sg + logf)] < (J/n^, for aU 6 > 0. 

Therefore Prob [ \E'\ > c] < for c = (e(A;sg + log f ))2''. 

Substituting this choice of c into ([T3]) and taking 6 = n~", we obtain: 

7r,,i = 0(n-(2+") + ^cq)'). 

Finally, we show that TTg^t = 0{n~^'^~^°'^) by substituting q = Csg ksg (h/r)'^ and r > rEDp(a)- Q 

Lemma 7.3. Consider a universe set U and a collection T of non-empty subsets of U called 
feasible sets. A random set E U is obtained by including each element e £ U independently with 
probability p^- The goal is to bound from above the number of disjoint feasible subsets of E. 

Fix q £ [0, 1] such that each feasible set contains at least one element e with pe < q. Let E' be 
the set of elements e G U such that F <^ E U {e} for some feasible set F. Then, for each b £ N, 



Froh[E contains b disjoint feasible sets] 



< min 

c6N 



Prob 



> c + 



1 



cq 



{cqf 



(14) 



Proof. An element e £ U with Pe ^ 9 is called a witness. Fix an arbitrary ordering p oi U in 
which all non-witnesses precede all witnesses. For each feasible set F £ J^, the latest witness in 
F according to p is called a canonical witness for F. If furthermore F E, then w is called 
E-important. Since each feasible set F C E contains an £^-important witness, from here on, we 
will focus on counting distinct iiJ-important witnesses (rather than disjoint feasible sets F <Z E). 

We reveal one by one whether elements of U are included in E, in the order of p. For each witness 
w, let Eu] be the actual subset of E that is revealed before w is considered. Let us say that w is 
p-important if it is a canonical witness for some feasible set F C EwD{w}. Then, w is £^-important 
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if and only if w £ E and w is p-important. The latter two events, namely {w is /9-important} and 
{w G E}, are independent. 

Let w{t) be t^^ p-important witness chosen in the above revelation process, Xt = and 

let N be the total number of p-important witnesses. Then, Sn — X^fLi is the total number of 
i?-important witnesses. Our goal is to bound Sn from above. 

We accomplish this goal via Lemma 17.41 below. The sequence {Xt} and the stopping time 
satisfy the conditions in Lemma 17.41 (the upper bound). Specifically, we have established that 
E [Xt \ N >t] = Pu,(t) < It and the event {Xt = 1} is independent of the past history given that 
N >t. By Lemma 17.41 we obtain that for all c, 

Prob [^Tv >h]< Prob [Binc,g >b] + Prob [A^ > c] , (15) 

where Binc^g is a random variable distributed according to the Binomial distribution with c samples 
and success probability q. We have Prob [ > c] < Prob [\E'\ > c], since each p- important witness 
is in E' . We complete the proof by noting that 

Prob[Bine,,>&] = EC)9'(1 ^ ^ ihr^cqf. □ 

l=b ^ ^ l=h 

Lemma 17.41 below is a technical lemma for analyzing a certain kind of "revelation process," 
in which a sequence of history-dependent 0-1 random variables is revealed, and the length of 
this sequence is also a history-dependent random variable. The lemma shows that whenever the 
expectation of each individual 0-1 random variable can be bounded, we can also bound the sum: 
we relate its distribution to the corresponding Binomial distribution. We will also use this lemma 
in the analysis of the adaptive algorithm in Section 17. 2i 

Lemma 7.4. Consider a stochastic process Xt G {0, 1}, t G N and a stopping time N on a filtration 
{Tt : t G N}. Define Sn = 'Yld=i -^t- Assume that for some constants p < q we have 

E[Xt \ N >t,F] e [p, q] for a// 1 G N, F G I't^i. 

Our goal is to bound the distribution of Sn in terms of the Binomial distribution. 

Let Bin^^p be a random variable distributed according to the Binomial distribution with t samples 
and success probability p. Then, for all x, t G N, we have that 

Prob [Bint,p > x]-Prob [iV < t] < Prob[S'Ar>x] < Prob [Bint,, > 2;]+Prob [iV < t] . (16) 

Proof. It suffices to prove the lower bound in (jl6p : the upper bound is then derived from the lower 
bound applied to the stochastic process {1 — Xt \ t £ N}. Let {Yj | t G N} be a family of mutually 
independent 0-1 random variables with expectation p, and define 



x: 



N > t 
otherwise. 



For each t, let St = Yfs=i ^s, = Yfs=i and = a{Xl,. . . , X^). For each event F G J7„i, 
we have that 

E [X; \ F, N > t] = E[Xt \ F, N > t] > p, 
E[X; \F,N <t]=E[Yt\F]=p, 
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which imphes that 

E [X* \F] = E [X; \ F,N >t]- Prob [iV > t \ F] + E[X* \ F, N < t] ■ Prob [iV <t\F] > p. 

By induction on t, it fohows that Prob [S^ > x] > Prob [Binj^p > x] for all x,t £ N. Noting that 
Sn ^ St = S* whenever N > t, we obtain that 

Prob [57V > 2;] > Prob[5jv > x | iV > t] -ProbfiV > t] 

> Prob [5; > x I iV > i] -ProbiiV > t] 
= Prob > 2; and iV > t] 

> Prob [5; >x]- Prob [iV < t] 

> Prob [Bint,p > x] -Prob [iV < i]. □ 
7.1.1 Running times in Theorem 17.21 

While the main thrust in this paper is information-theoretic, the algorithms in Theorem 17.21 are 
actually polynomial. Let us discuss how to improve the running times to near-linear, an important 
feature for the sizes of networks we are envisioning. 

The naive implementation of the (6, /i)-EDP Pruning Algorithm checks every remaining edge 
at each iteration, which gives a running time of O(n^). We show how to reduce it to to 0{n). 

Lemma 7.5. The {h,h)-EDP Pruning Algorithm can he implemented in 0{n) time for constant b 
and h. 

Proof. We maintain a queue of edges to be checked, initially containing all edges of -Egg- In each 
step, one edge e = {u, v) is removed from the queue and checked for pruning with respect to the 
current pruned graph i?cur- If Ecur does not contain the requisite 6-tuple of edge-disjoint paths 
of length at most h, then e is pruned permanently. Otherwise, the 6-tuple of paths provides a 
"certificate" for e. Later iterations may remove edges from this certificate; therefore, for each edge 
e' in the certificate, the algorithm stores a pointer that e' is part of the certificate for e. If e' is 
pruned at any point, then, following the pointers, the algorithm can determine all edges e whose 
certificates e' participates in. Upon pruning e' , all such edges e are then re-enqueued and will 
need to be checked again for alternative certificates. Once the queue becomes empty, the algorithm 
terminates. 

Without loss of generality, the target degree ksg is 0(log^ n) (otherwise, the much more efficient 
Two-Hop Test from Section U] would be used). By Chernoff Bounds, all node degrees are 0(log^ n) 
with high probability. Finding a certificate for a given edge using brute force then takes only 
polylog(n) time. Moreover, for each edge e, there can be at most polylog(n) edges whose certificates 
e participates in. No new edges are added to the queue if the current edge is not pruned, and at 
most polylog(n) edges are added otherwise. Therefore, the running time is 0{n). □ 

We also comment on the running time of the Two-Ball Algorithm. Applying this algorithm to 
a given node pair {u, v) can be computationally expensive when v) is large (and consequently, 
the algorithm needs to consider large balls around u and v). Thus, the Two-Ball Algorithm for a 
given node pair can be viewed as a precise but costly distance measurement. Instead of applying it 
to every node pair, we could instead use the beacon-based triangulation technique from [33]: here, 
one selects 0{{^) i^)'^) "beacon nodes" uniformly at random, and measures the distance from each 
node only to each beacon. This technique achieves distortion (1 -|- 5)C for all but an e-fraction of 
node pairs, where C is the distortion of the Two-Ball test. 
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7.2 Adapting to the "optimal" richness 

Theorem 17.21 assumes that the (6, /i)-richness of the local edge set ii^ioc is known to the algorithm. 
In reality, it is desirable to adapt to the "optimal" richness without knowing it in advance. Here, 
the "optimal" richness means the (6, h) pair that minimizes r^Dp{ct) in Equation (jl2p . subject to the 
constraint that ii^ioc is {b, /i)-rich with small distortion. We show that such an automatic adaptation 
can be achieved if -Eioc is "robust," in the sense defined below. 

Our algorithm, called Adaptive EDP algorithm, proceeds as follows: for a given set H of candi- 
date hop counts, we try all (6, h) pairs, h & H, in order of increasing rEDp(a) until the pruned graph 
is connected, and focus on the last pair. Without loss of generality, we can start with b equal to 
the smallest node degree in Egg. We can use binary search over the (6, h) pairs (in the same order) 
to reduce the number of pairs that we need to consider. 

While the above algorithm is very simple, the challenge is to prove that it works. That is, we 
need to identify a suitable "robustness property" of E'loc and argue that under this property, the 
chosen (b, h) pair is optimal. Let Ti,^h[E) denote the pruned graph if (6, /i)-EDP Pruning Algorithm 
is applied to the edge set E. We rely on the following crucial observation: 

Lemma 7.6. Consider a single- category social graph with near-uniform density. Suppose that the 
local structure Eioc is a 6) -spanner, and moreover, Th^h{Eioc) contains at least en isolated nodes, 
for some parameters b, h, e, 6 such that 

{26hy ClnCsgksg<l. (17) 

Then Ti)^h[Esg) is disconnected with high probability. 

Remark. Since Cgg = G(l/logri) and Cud = ©(l); condition (fT7|) holds, for large enough n, 
whenever fcgg, 6 and h are constants. 

Lemma 17.61 is proved below. It naturally motivates the following definition of "robustness." 

Definition 7.7. A connected graph G = {V.,E) is called {e,h)-robust with distortion (o", 5), for 
some e E (0, 1], if the following holds for every b: either G is (6, /i)-rich with distortion (c, 5), or 
Tb^h{E) contains at least en isolated nodes0 

In the first case of this definition, we can use the (6, /i)-EDP Pruning Algorithm safely, while in 
the second case, we will show that Tf, fi{Esg) is disconnected with high probability. 

Notice that the toroidal grid is (1, /i)-robust for any h. We give more examples of robust graphs 
in Section 17.2.11 

Theorem 7.8. Consider a single- category social graph with near-uniform density and local struc- 
ture Eioc- Suppose that for all h & H, Eioc is {e,h)-robust with distortion (a, (5) and ()17p holds. 
Then, when the Adaptive EDP algorithm is run with the candidate set H , it will obtain the guar- 
antees of Theorem \ 7. S\ for the optimum pair (b, h) among all h £ H . 

Proof. The Adaptive EDP algorithm picks the pair (6, h) with optimal tedp (a) among all pairs 
{b, h),h £ H such that the pruned graph Tb^h{Esg) is connected. By Lemma EH with high proba- 
bility, this is the set of all pairs {b,h),h G H such that the local structure E^^c is (5, /i)-rich with 
distortion (cr, 5). □ 

^^Note that any graph G in Definition 17.71 is a (cr, 5)-spanner. This is because for fe = 1 no edges are pruned, and 
so G must be (1, /i)-rich with distortion (a, S), which in turn implies that it is a (a, (5)-spanner. 
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Proof of Lemma \ 7. 6[ Fix (6, h) and let T = Th^h- Let / be the set of en isolated nodes in T(£'ioc). 

The high-level idea of the proof is as follows. For each node u £ I and any edge {u,v) € £'100 
the local structure E[oc alone does not contain b edge-disjoint paths of length at most h. Thus, for 
u not to be isolated in T{Esg), a small neighborhood of u would have to be incident on at least 
one random edge. Because there are at least en such isolated nodes n, we will be able to show 
that with high probability, at least one of them will end up isolated in T{Esg). This is not trivial 
as there is significant dependence between the isolation events for different nodes; we solve this 
issue by considering a sufficiently spread-out subset M oi I (which limits the dependence), and 
then applying Lemma 17.41 to a carefully designed revelation process. We now fill in the remaining 
technical details. 

For any set 5 C y, let £{S) denote the event that E^g contains no random edges incident on 
S. We begin by lower-bounding Prob[£^(M)] for individual nodes u. Fix u and a distance scale 
r, and let Ur be the set of nodes v with P(u, u) G (r, 2r]. There are at most Cud ■ (2r)'^ nodes 
in Ur-, and for each node v £ Ur^ sxi edge (u, u) is created independently with probability at most 
q = Csg ksgr~'^. Thus, the probability that u has no edges to any nodes in Ur is at least 



a/-? 



2^ C*TTr)"C*sH fcs) 



Here, we used the fact that the function f{q) = (1 — q) is decreasing in q, so in particular 
f{q) > /(i) = i for any q < i. 

The event that u has no random edges is now the intersection of the events that u has no 
random edges at scale r, with r ranging over powers of 2. Thus, S{u) is the intersection of log(n) 
independent events, each with probability at least 4-2*^ Cud -Csg feg^ Thus, for each node u, 

Floh[£{u)] > 4-2'^C'uD-Csgfcsg logn ^ ^-2-2'*CuD-CsgA:sg^ 

For any node u £ I, let Vu be the (/i— l)-hop neighborhood of u in -Eioc- Note that Vu ^ B{u, 5h), 
so it contains at most Cud (Sh)'^ nodes. We consider events SiVu) that no node in Vu is incident 
on any random edges. The absence of any random edges incident on a subset of nodes V can 
only increase the probability that no random edge is incident on a given node u, as there are fewer 
remaining candidate edges. In this sense, the events {£{v) \ v} are positively correlated, and we 
can bound 

Prob[f(T4)] = Pioh[f]^^y^£{v)] > ^^,ev„ P^ob [£:(?;) ] > n-2-(25'^)' ^UD-C=g ^g. (18) 

By the assumption (jl7p . the above expression is at most p = n~^/^. 

We claim that whenever £{Vu) happens, the node u £ I is isolated in T{Esg). First, note that 
under the event £{Vu), u itself has no incident random edges. Let {u,v) G Ei^c be arbitrary. We 
show that {u,v) must be pruned. Because no random edges are incident on Vu, no path in T{Esg) 
of length at most h starting from u can contain any random edge. Thus, all u-v paths of length at 
most h in T{Esg) must be entirely in Eioc- However, {u,v) ^ T{E\oc), so -Eioc does not contain b 
edge-disjoint u-v paths of length at most h. Hence, {u,v) ^ T{Esg). 

It remains to show that with high probability, at least one of the events £{Vu),u £ I happens. 
To limit the dependence between the events under consideration, we focus on a subset C /. Let 
A/" C / be a 2C/i-net for (/,P)E1 Because there are at most 0{{Ch)'^) nodes within distance 2Ch 



^^Recall that an r-net for a metric space {V, T>) is a set of points Af V such that (i) any two points in Af are at 
distance at least r from one another, and (ii) any point in V is within distance at most r from some point in A/". It 
is a well-known fact that such sets exist and can be constructed greedily by adding one point at a time. 
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of any node u, we obtain that \J\f\ > en/0{{Ch)'^). Furthermore, because -Eioc has distortion at 
most C, we get that Vu ^ B{u, C{h— 1)), implying that the neighborhoods Vu, u £ M are pairwise 
disjoint. 

The events £{Vu),u £ M are still not independent, but their dependence is now more limited, 
making them amenable to the technique of Lemma 17.41 We define an ordering for revealing the 
presence (or absence) of edges {u,v), along with a revelation of the events £{Vu),u € Af. Fix some 
ordering ip on Af, and start with R = N . R throughout will be a set of candidate nodes u such 
that the event £'(V^) has not been ruled out. In step t = 1, 2, . . ., if i? 7^ 0, let G i? be the first 
remaining node in R according to (p. Reveal the presence or absence of all random edges incident 
on Vut which have not been revealed yet. Whenever a random edge {v,v') is revealed to be present 
such that V € Vut,v' £ Vu' for some u' £ R, remove u' from R. (In this case, £{Vu') clearly cannot 
happen any more.) Once R is empty, reveal the presence or absence of all remaining random edges. 
Clearly, this is an equivalent way of revealing the random edge set Egg. 

Consider a particular step t, during which a node ut £ R is processed. If no edges incident on 
Vut revealed, the event SiVut) has happened, and T(£'sg) will be disconnected. Conditioned on 
processing node ut, the event £{Vut) happens with probability at least p = n"^/^, as the absence 
of some edges incident on Vu^ may already have been revealed earlier, whereas no edges can have 
been revealed as present. (Otherwise, ut would have been removed from R.) 

Let N be the number of steps t of the revelation process, and let Xt be the indicator variable 
of the event £{Vuf )- Thus, whenever each Vu,u £ M has an incident random edge, we have that 
^tLi = 0. It thus suffices to upper-bound the probability that X^^Li = 0, which can be 
accomplished using the lower bound of Lemma 17.41 with x = 0: 

Prob [Ef=i^i > ij > (1 - (I-pY) -ProhiN < t] , for ah t £ N, 
or equivalently, 

Pvoh\j2^^^Xt = o]<{l-pY + Proh[N<t], for all t G N. (19) 
We choose t = ey/n. Then, 

so {1—pY is exponentially small. Finally, we bound the probability that N < e-y/n. Consider a step 
t of the revelation process. With high probability, each node in Vu^ has at most O(logn) incident 
random edges, so that the total number of random edges incident on Vu is at most ©(/cj^^, log n). 
Thus, with high probability, at most O(fcioc logn) other nodes u can be removed from R in any one 
step, implying that the process will take at least 



steps, for sufficiently large n. In particular, N > t with high probability, completing the proof. □ 
7.2.1 Examples of robust graphs 

Recall that the toroidal grid is (1, /i)-robust for any h. The grid example extends to graphs that 
are edge-transitive on a small scale, in the following sense. 
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Definition 7.9. Fix a graph G and a path length h. For any edge e, let Hf, be the induced 
subgraph of the ^-hop neighborhood of e in G. Two edges e, e' are locally h-equivalent if there 
exists an isomorphism (pg^gi : Hg — t- Hgi with 0(e) = e'. G is called edge-transitive at scale /i if any 
two edges are locally /i-equivalent. 

Notice that the traditional definition of edge-transitive graphs is obtained when h equals the 
graph's diameter. 

Let G be an edge-transitive graph at scale h that is a (cr, 5)-spanner for D. It is easy to see 
that G is (1, /i)-robust with distortion {a, 6). Indeed, the /i-hop neighborhood of a given edge (u, v) 
determines whether this edge is {b, /i)-connected (i.e., whether there exist b edge-disjoint u-v paths 
of at most h hops each). So for every given b, either every edge in G is {b, /i)-connected, or every 
edge in G is not (b, /i)-connected and therefore pruned by the (6, /i)-EDP Pruning Algorithm. 

We further generalize this example to graphs G with some short edges added. Specifically, pick 
an arbitrary node set S* C F such that its (/i + l)-neighborhood in G contains at most 1 — en nodes, 
for some e E (0, 1). Add arbitrary edges {u,v) such that u,v & S and T){u,v) < 5. Note that the 
resulting graph G' is also a (cr, 5)-spanner for T). 

We claim that G' is (e, /i)-robust with distortion (cr, 5). Indeed, if G is (6, /i)-connected for some 
b then G' is (6, /i)-rich with distortion (cr, 5) and connectivity witness G. Otherwise, no edge in 
G is (6, /i)-connected in G alone. Consider the complement S' of the {h + l)-neighborhood of S. 
Any edge e in G' with at least one endpoint in 5' is also present in G, and moreover has the same 
/i-neighborhood in both graphs. It follows that e is not (6, /i)-connected in G'; consequently, it is 
pruned by the (6, /i)-EDP Pruning Algorithm. Therefore every node in S' is isolated in Tb h{G'). 



8 Category Disjointness and Random Permutations 

Recall that our motivation for the definition of the Local Category-Disjointness and Scale-i? 
Category-Disjointness conditions was that they intuitively capture the notion of categories looking 
random with respect to one another "locally." In this section, we confirm the intuition guiding 
the definition, by showing that both conditions are satisfied with high probability when the metric 
space for each category i is randomly permuted, in the sense that T>i{u,v) = P-((Tj(ti), <Ti(f)) for 
some "base metric" P- and a random permutation crj on the node set. Accordingly, both conditions 
are indeed significantly weaker (in particular, more local) than requiring that metrics be randomly 
permuted. 

Lemma 8.1. Consider a multi-category social graph with near-uniform density. For each category 
i, let T)[ be a "base" metric, and ai a uniformly random permutation of the node set. (The per- 
mutations for different metrics are pairwise independent.) For each node pair {u,v), the category-i 
distance is'Di{u,v) = T)[{ai{u), ai{v)). Then, with high probability, the Local Category-Disjointness 
and Scale-oo Category-Disjointness conditions are satisfied\^ 

Proof. Our proof uses an extension of Chernoff Bounds to dependent random variables in which 
the randomness comes from a random permutation (Theorem 18.21 stated and proved below). 

We begin by proving that the Local Category-Disjointness condition is satisfied. Fix two cate- 
gories i 7^ i' . Consider balls B, B' of radii r,r' = polylog(n) in categories i,i', respectively. Note 
that E [\B n B'\] = 0{{rrY/n) < 1. 

^'^Therefore Scale- i? Category-Disjointness is satisfied for any R. 
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Define a mapping from category i to category j by cr(u) = (T~^{ai{u)) -.V . a{u) captures 
at what point of the metric space T)j a node in the metric space ends up. Because cTj,cjj were 
independent uniform permutations on V , so is a. We wih consider nodes u £ B, which we capture 
by setting Ou = 'i-{ueB}- Such a node is also in B' iff a{u) G B'. Thus, defining Xu = l{a{u)£B'}j 
we get that \B D B'\ = "^^^^y ctu ^u, and by Theorem 18.21 this sum is at most O(logn) with high 
probability. 

Next, we prove that the Scale-oo Category-Disjointness condition holds as well. Fix a category 
j, distance scale r > 0, and two disjoint sets B,B' C V, \B'\ > \B\ (which will be balls in category 
i). Define the random variable f{B, B') = J2veB v'eB' '^{v ■{v,v')<r} to be the number of node pairs 
at category-j distance at most r. 

We will prove a high-probability bound on f{B, B') conditioned on the choice of all permutations 
CTj for i ^ j- In other words, we consider the probability space induced by the random choice of 
a = aj. We will prove that with high probability, 

/(B, B') = 0{r'^/n) ■ \B\ \B'\ + 0{\og^ n). (20) 

Then the Scale-oo Category-Disjointness condition follows by taking a Union Bound over all cate- 
gories i, J, all pairs of balls B,B' in category i, and all distinct distances r in category j. 

We begin by calculating the expectation of f{B, B') using linearity of expectation. Notice that 

E '^{'Dj{v,v')<r} = Vmh[T)j{v^v') < r] is the probability that v' is mapped to a node in a ball 
around v of radius r. Since there are 0(r'^) nodes in the ball around v of radius r (wherever v itself 
is mapped), we get that E '\.{v,(v,v')<t} = e(r^/n), and E [f{B, B')] = Q{r'^/n) ■ \B\ \B'\. 

It remains to prove that f(B,B') is concentrated around its expectation. Thereto, we will use 
Theorem 18.21 twice. First, focus on an arbitrary node v' and consider f{B,{v'}). We have that 
K[f{B,{v'})] = Q{r'^/n) ■ \B\. We can reveal the randomness of a by first revealing (t{v'), which 
defines a set C/ = {u £ V \ T^'j{u, (j{v')) < r}. Then, 

f{B,{v'}) = Y^^^v Oiv^{aiv)eU}, 

where = 1 ii v £ B, and = otherwise. Thus, Theorem 18.21 implies concentration of 
f{B, {v'}) for any v' , and gives us that with high probability, f{B, {v'}) = 0(max(logn, ^ • \B\)) 

for all v' . Let := G(max(logn, ^ • \B\)) denote this high-probability bound. 

Next, our goal is to sum over all v' € B' . First, reveal (7{v) for all v £ B, and condition on this 
choice, writing T = {a{v) \ v £ B}. Then, a is defined by a uniformly random permutation from 
y \ to y \ T, or — equivalently — by a uniformly random permutation from V\T to V\B. 
For each u' £ V\T, let /3u' = 'Yu&T ^{V {u,u')<r} be the number of nearby locations to which nodes 
in B were mapped. Then, we can write 

f{B,B') = X]«'ey\T l{<7-i{«')es'} = ^'^u'evXT %^ ' l{f7-i(«')eB'}- 

Defining a„/ = min(l, ^) £ [0, 1], we get that with high probability (in the high-probability event 
that f{B, {v'}) < N for ah v'), 

f{B, B') < N ■ J2u'&v\T l{(T-i(«')6S'}' 
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and 0" ^ is a uniformly random permutation. By Theorem 18.21 with high probabihty, 

f{B,B')< iV-0(E«'6nT l{<T-i(«')eB'} 
= 0(E [f{B,B')] +Nlogn). 

UN = e(logn), this bound is obviously 0{E [f{B, B')] + log^ n). Otherwise, N = • \B\), and 

^ ■ \B\ = r2(logn), which implies (because r'^ < n) that = O(logn). And because we assumed 
that \B'\ > \B\, we get that 

K[f{B,B')] = Q{r'^/n) -{BWB'l > Q{r'^ /n) ■ \B\ logn > e(iVlogn) 

so that the A^logn term is subsumed in the K[f{B,B')] term. This completes the proof of the 
lemma. □ 

Theorem 8.2 (Chernoff Bounds for permutations). Fix n £ N and a subset I C {1, . . . ,77,}. Let 
a be a uniformly random permutation of {1, . . . ,n} . For each i G {1, . . . , n}, fix ai € [0, 1] and let 
Xi = l|g.(j)gj|. Then X = ^^^i 0^X1 satisfies both conditions from Theorem \3.1\ 

Prob [\X — 8^]< exp(— ;U 5^/3), for any 5 > 
VTob[X > {l + 5)^l] < exp(-;u'5V3), /or any 5 G (0, 1). 

While the result appears standard, we are not aware of a published proof, so for completeness 
we provide a self-contained proof. The proof uses Chernoff Bounds for negatively associated random 
variables (see, e.g., [E]). We summarize the relevant result in the following theorem: 

Theorem 8.3 ( jl71 pages 34-35 and Problem 3.1]). Let be random variables jointly 

distributed on [0,1]" such that "^^i-^i ^■^ constant. For any subset I C {l,...,n}, write Sj = 
Yli^l-^i- Assume that the following hold for any such subset: 

• Any Xi with i £ I is conditionally independent of the Xj with j ^ I given Sj. 

• For any coordinate-wise non- decreasing function f : RI^I — t- M, the conditional expectation 
E [f{Xi, i £ I) \ Sj = t] is non- decreasing as a function of t gM. 

Then, the random variables Xi, . . . ,X„ are said to be negatively associated. In particular, it 
follows that X = ai Xi satisfies the bounds from Theorem \8.^ for any fixed ai, . . . ,a„ G [0, 1]. 

Proof of Theorem \8.2l First note that by definition, Y17=i -^i ~ l-^l ^ constant. Thus, it suffices 
to verify that the random variables Xi are negatively associated. 

Fix / C {1, . . . , n}. For each t S N, let J^t be the set of all tuples (xj, i £ I) such that Xi G {0, 1} 
and Yliel ^« ~ uniform distribution over Tt- 

To establish the first property of negative association, simply note that the conditional distri- 
bution of {Xi, i £ I) given Si = t and any assignment for {Xi, i ^ I) is Ut, so independence is 
established. 

For the second property, fix a coordinate-wise non-decreasing function / : RI^I — )■ R. Since the 
conditional distribution of {Xi, i £ I) given {5/ = t} is Ut, we have that 

g{t) ^ E [f{Xi, i£l)\Si = t]= E^^u^ [f{x)] . 

We need to show that g{t + 1) > g{t). We couple selections according to Ut and Ut+i as follows. 
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• Pick X ~ Ut- 

• Pick j uniformly at random from {i £ I \ Xi = 0} . 

• Set Uj = 1, and Ui = Xi for all j / i. 

Notice that y ~ Ut+i- By monotonicity of /, we have that f{y) > f{x). It follows that 

g{t + 1) = Ey^^u,^^ [fm > E^^u, [fix)] = g{t). 
The claim now follows from applying the result for negatively associated random variables. □ 

9 Conclusions 

We have shown that, under standard assumptions about generative models for social networks, 
it is possible to reconstruct social spaces with small distortion from a multiplex social network; 
indeed, it is possible to do so in near-linear time. The edges do not need to be labeled with 
their "origin," so long as the categories are "locally sufficiently uncor related." Under increasingly 
stronger assumptions, the distortion can be improved from constant, to 1 + 0(1), to poly-logarithmic 
additive error. While these results rely on having poly-logarithmic node degree, we also show that 
small polynomial distortion can be obtained in the constant-degree regime, so long as the social 
network contains a sufficiently rich local structure. This is possible even if the algorithm only 
possesses very rudimentary knowledge about the local structure. 

While our results can be interpreted as a proof of concept — it is possible in principle to 
efficiently separate the different dimensions of social interactions and identify similarities between 
individuals — they set the stage for a number of possible extensions. 

1. There are several specific technical open questions within our model, the most immediate of 
which is extending the multi-category results to the constant-degree regime. 

2. We assumed that the algorithm had knowledge of various input parameters (the number of 
categories, the number of dimensions, etc.), whereas ideally, the algorithm should be able to 
learn these parameters from input data as well. 

3. For our multi-category algorithms to work, we required a "category disjointness" condition, 
essentially stating that locally, metrics look uncorrelated with respect to each other. It seems 
unlikely that one could reconstruct metrics if categories were extremely similar, but it is 
an interesting open question how much our current condition could be weakened while still 
allowing for provable reconstruction. In particular, we conjecture that future work will be 
able to deal with a few localized violations of the category disjointness condition, so that they 
lead to incorrect distance estimates only for the affected nodes, without propagating to other 
parts of the metric space. 

4. Our model so far also assumes that the node degrees are essentially uniform across nodes, 
which will usually not hold in practice. A corresponding extension for the single-category case 
might not be too difficult, but inferring the individual node degrees for multiple categories 
appears more difficult. 

5. Finally, and perhaps most importantly, one may want to consider "host spaces" other than 
Euclidean space with near uniform density, such as ultrametrics, more general "group struc- 
tures" (e.g., dU), or point sets with significantly non- uniform density. It would be particularly 
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interesting if an algorithm did not need to know the structure of the host space in advance, 
and instead could infer it from the data. 

In practice, there will usually be additional information available beyond the edges. This 
may include information about nodes' locations, interests, or demographics (as collected by social 
networking sites); partial interaction statistics along the edges; or perhaps a social network that 
has been previously embedded in a social distance space, but is now being extended by a few new 
nodes. In either case, it is an interesting question how to formalize the benefits that can be obtained 
with such side information. In particular, time stamps on edges introduce a temporal dimension 
into the problem: now, instead of fixed node locations in the social space, one could ask about 
nodes' trajectories over time. 
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